Neural Surface Realization for Italian

                    Valerio Basile                                 Alessandro Mazzei
             Dipartimento di Informatica                       Dipartimento di Informatica
            Università degli Studi di Torino                 Università degli Studi di Torino
           Corso Svizzera 185, 10153 Torino                  Corso Svizzera 185, 10153 Torino
           basile@di.unito.it.com                               mazzei@di.unito.it


                     Abstract                               We consider the surface realization of un-
                                                         ordered Universal Dependency (UD) trees, i.e.,
    We present an architecture based on neural           syntactic structures where the words of a sentence
    networks to generate natural language                are connected by labeled directed arcs in a tree-
    from unordered dependency trees. The                 like fashion. The labels on the arcs indicate the
    task is split into the two subproblems of            syntactic relation holding between each word and
    word order prediction and morphology                 its dependent words (Figure 1a). We approach
    inflection. We test our model gold corpus            the surface realization task in a supervised statis-
    (the Italian portion of the Universal De-            tical setting. In particular, we draw inspiration
    pendency treebanks) and an automatically             from Basile (2015) by dividing the task into the
    parsed corpus from the Web.                          two independent subtasks of word order predic-
    (Italian) Questo lavoro introduce                    tion and morphology inflection prediction. Two
    un’architettura basata su reti neurali               neural network-based models run in parallel on the
    per generare frasi in linguaggio natu-               same input structure, and their output is later com-
    rale a partire da alberi a dipendenze.               bined to produce the final surface form.
    Il processo è diviso nei due sotto-                    A first version of the system implementing our
    problemi dell’ordinamento di parole e                proposed architecture (called the DipInfo-UniTo
    dell’inflessione morfologica, per i quali            realizer) was submitted to the shallow track of the
    la nostra architettura prevede due modelli           Surface Realization Shared Task 2018 (Mille et al.,
    indipendenti, il cui risultato è combinato          2018). The main research goal of this paper is to
    nella fase finale. Abbiamo testato il                provide a critical analysis for tuning the training
    modello usando un gold corpus e un silver            data and learning parameters of the DipInfo-UniTo
    corpus ottenuto dal Web.                             realizer.

1   Introduction                                         2     Neural network-based Surface
                                                               Realization
Natural Language Generation is the process of
producing natural language utterances from an ab-        In the following sections, we detail the two neural
stract representation of knowledge. As opposed to        networks employed to solve the subtasks of word
Natural Language Understanding, where the input          order prediction (2.1) and morphology inflection
is well-defined (typically a text or speech segment)     (2.2) respectively.
and the output may vary in terms of complexity
and scope of the analysis, in the generation process     2.1    Word Ordering
the input can take different forms and levels of ab-     We reformulate the problem of sentence-wise
straction, depending on the specific goals and ap-       word ordering in terms of reordering the subtrees
plicative scenarios. However, the input structures       of its syntactical structure. The algorithm is com-
for generation should be at least formally defined.      posed of three steps: i) splitting the unordered tree
   In this work we focus on the final part of the        into single-level unordered subtrees; ii) predicting
standard NLG pipeline defined by Reiter and Dale         the local word order for each subtree; iii) recom-
(2000), that is, surface realization, the task of pro-   posing the single-level ordered subtrees into a sin-
ducing natural language from formal abstract rep-        gle multi-level ordered tree to obtain the global
resentations of sentences’ meaning and syntax.           word order.
   In the first step, we split the original unordered                                permutation of an ordered list. This approxima-
universal dependency multilevel tree into a num-                                     tion is necessary to keep the problem tractable by
ber of single-level unordered trees, where each                                      avoiding the exponential explosion of the number
subtree is composed by a head (the root) and all                                     of permutations. Formally, the top one probability
its dependents (the children), similarly to Bohnet                                   of an object j is defined as
et al. (2012). An example is shown in Figure 1:                                                                X
                                                                                                 Ps (j) =             Ps (π)
                                                                                                           π(1)=j,π∈Ωn
                                            ROOT
                                                                                     that is, the sum of the probabilities of all the pos-
                                                                                     sible permutations of n objects (denoted as Ωn )
                                           contenere                                 where j is the first element. s = (s1 , ..., sn ) is a
                                                                                     given list of scores, i.e., the position of elements in
                                                                                     the list. Considering two permutations of the same
                             opera         prodotto          .                       list y and z (for instance, the predicted order and
                                                                                     the reference order) their distance is computed us-
                                                                                     ing cross entropy. The distance measure and the
               suo        numeroso         chimico        tossico
                                                                                     top one probabilities of the list elements are used
 (a) Tree corresponding to the Italian sentence “Numerose                            in the loss function:
 sue opere contengono prodotti chimici tossici.” (“Many                                                      n
                                                                                                             X
 of his works contain toxic chemicals.”)                                                       L(y, z) = −         Py (j)log(Pz (j))
              contenere                    opera                    prodotto
                                                                                                             j=1

 prodotto         .       opera      suo       numeroso     chimico        tossico   The list-wise loss function is plugged into a lin-
            (b) Three subtrees extracted from the main tree.                         ear neural network model to provide a learning
                                                                                     environment. ListNet takes as input a sequence
Figure 1: Splitting the input tree into subtrees to                                  of ordered lists of feature vectors (the features are
extract lists of items for learning to rank.                                         encoded as numeric vectors). The weights of the
                                                                                     network are iteratively adjusted by computing a
from the (unordered) tree representing the sen-                                      list-wise cost function that measure the distance
tence “Numerose sue opere contengono prodotti                                        between the reference ranking and the prediction
chimici tossici.” (1a), each of its component sub-                                   of the model and passing its value to the gradient
trees (limited to one-level dependency) is consid-                                   descent algorithm for optimization of the parame-
ered separarately (1b). The head and the depen-                                      ters.
dents of each subtree form an unordered list of lex-                                    The choice of features for the supervised learn-
ical items. Crucially, we leverage the flat structure                                ing to rank component is a critical point of our
of the subtrees in order to extract structures that                                  solution. We use several word-level features en-
are suitable as input to the learning to rank algo-                                  coded as one-hot vectors, namely: the universal
rithm in the next step of the process.                                               POS-tag, the treebank specific POS tag, the mor-
   In the second step of the algorithm, we predict                                   phology features and the head-status of the word
the relative order of the head and the dependents                                    (head of the single-level tree vs. leaf). Further-
of each subtree with a learning to rank approach.                                    more, we included word representations, differen-
We employ the list-wise learning to rank algorithm                                   tiating between content words and function words:
ListNet, proposed by Cao et al. (2007). The rela-                                    for open-class word lemmas (content words) we
tively small size of the lists of items to rank al-                                  added the corresponding language-specific word
lows us to use a list-wise approach, as opposed to                                   embedding to the feature vector, from the pre-
pair-wise or poin-twise approaches, while keeping                                    trained multilingual model Polyglot (Al-Rfou’ et
the computation times manageable. ListNet uses a                                     al., 2013). Closed-class word lemmas (function
list-wise loss function based on top one probabil-                                   words) are encoded as one-hot bags of words vec-
ity, i.e., the probability of an element of being the                                tors. An implementation of the feature encoding
first one in the ranking. The top one probability                                    for the word ordering module of our architecture
model approximates the permutation probability                                       is available online1 .
model that assigns a probability to each possible                                       1
                                                                                            https://github.com/alexmazzei/ud2ln
  In the third step of the word ordering algorithm,            the performances of the DipInfo-UniTo realizer,
we reconstruct the global (i.e. sentence-level) or-            namely training data and learning parameters set-
der from the local order of the one-level trees un-            tings. In Basile and Mazzei (2018), the hard-
der the hypothesis of projectivity2 — see Basile               ware limitations did not allow for an extensive
and Mazzei (2018) for details on this step.                    experimentation dedicated to the optimization of
                                                               the realizer performances. In this paper, we aim
2.2      Morphology Inflection                                 to bridge this gap by experimenting with higher
The second component of our architecture is re-                computing capabilities, specifically a virtualized
sponsible for the morphology inflection. The task              GNU/Linux box with 16-core and 64GB of RAM.
is formulated as an alignment problem between
characters that can be modeled with the sequence               3.1    Training Data
to sequence paradigm. We use a deep neural net-                For our experiments, we used the four Italian
work architecture based on a hard attention mech-              corpora annotated with Universal Dependencies
anism. The model has been recently introduced by               available on the Universal Dependency reposito-
Aharoni and Goldberg (2017). The model consists                ries3 . In total, they comprise 270,703 tokens and
of a neural network in an encoder-decoder setting.             12,838 sentences. We have previously used this
However, at each step of the training, the model               corpus for the training of the DipInfo-UniTo real-
can either write a symbol to the output sequence,              izer that participated to the SRST18 competition
or move the attention pointer to the next state of             (Basile and Mazzei, 2018). We refer to this corpus
the sequence. This mechanism is meant to model                 as Gold-SRST18 henceforth.
the natural monotonic alignment between the in-                   Moreover, we used a larger corpus extracted
put and output sequences, while allowing the free-             from ItWaC, a large unannotated corpus of Ital-
dom to condition the output on the entire input se-            ian (Baroni et al., 2009). We parsed ItWaC with
quence.                                                        UDpipe (Straka and Straková, 2017), and selected
   We employ all the morphological features pro-               a random sample of 9,427 sentence (274,115 to-
vided by the UD annotation and the dependency                  kens). We refer to this corpus as Silver-WaC
relation binding the word to its head, that is, we             henceforth.
transform the training files into a set of struc-
                                                               3.2    Word Ordering Performances
tures ((lemma, f eatures), f orm) in order to
learn the neural inflectional model associating a              We trained the word order prediction module of
(lemma, f eatures) to the corresponding f orm.                 our system4 on the Gold-SRST18 corpus as well
An example of training instance for our morphol-               as on the larger corpus created by concatenating
ogy inflection module is the following:                        Gold-SRST18 and Silver-WaC.
                                                                  The performance of the ListNet algorithm for
lemma: artificiale
                                                               word ordering is given in terms of average
features:
                                                               Kendall’s Tau (Kendall, 1938, τ ), a measure of
  uPoS=ADJ
                                                               rank correlation used to give a score to each of the
  xPoS=A
                                                               rankings predicted by our model for every subtree
  rel=amod
                                                               (Figure 2). τ measures the similarity between two
  Number=Plur
                                                               rankings by counting how many pairs of elements
form: artificiali
                                                               are swapped with respect to the original ordering
Corresponding to the word form artificiali, an in-             out of all possible pairs of n elements:
flected form (plural) of the lemma artificiale (arti-
ficial).                                                              #concordant pairs − #discordant tpairs
                                                               τ=                   1
                                                                                    2 n(n − 1)
3       Evaluation
                                                               Therefore, τ ranges from -1 to 1.
In this section, we present an evaluation of the                  In Figure 2 we reported the τ values obtained
models presented in Section 2, with particular                 at various epochs of learning for both the Gold-
consideration for two crucial points influencing                  3
                                                                   http://universaldependencies.org/
    2                                                             4
     As a consequence of the design of our approach, the           Our implementation of ListNet featuring a regularization
DipInfo-UniTo realizer cannot predict the correct word order   parameter to prevent overfitting is available at https://
for non-projective sentences.                                  github.com/valeriobasile/listnet
                 0.8                                                                                                3.3     Morphology Inflection Performances
                 0.7

                 0.6                                                                                                In order to understand the impact of the Silver-
                 0.5                                                                                                WaC corpus on the global performance of the sys-
                                                                                                                    tem, we trained the DNN system for morphology
 Kendall-Tau


                 0.4


                                                                                                                    inflection5 both on the Gold-SRST18 corpus and
                 0.3

                 0.2

                 0.1                                                                                                on the larger corpus composed by Gold-SRST18+
                      0                              Gold-SRST18 training set,LR=0.00005
                                                    Gold-SRST18 training set,LR=0.000005
                                                                                                                    Silver-WaC. In Figure 3 we reported the accuracy
                -0.1                               Gold-SRST18 training set,LR=0.0000005
                                         Gold-SRST18+Silver-WaC training set,LR=0.000005
                                                                                                                    on the SRST18 development set for both the cor-
                -0.2
                          0        10    20     30     40     50      60        70         80        90      100    pora. A first analysis of the trend shows little im-
                                                            Epochs
                                                                                                                    provement to the global performance of the real-
Figure 2: The trend of the τ value with respect to                                                                  ization from the inclusion of additional data (see
the ListNet iteration.                                                                                              the discussion in the next section).

                                                                                                                    3.4     Global Surface Realization Performances
                                                                                                                    Finally, we evaluate the end-to-end performance
               0.96                                                                                                 of our systems by combining the output of the two
               0.94                                                                                                 modules and submitting it to the evaluation scorer
               0.92
                                                                                                                    of the Surface Realization Shared Task. In Ta-
                                                                                                                    ble 1 we report the performance of various tests
                                                                                                                    systems with respect to the BLUE-4, DIST, NIST
                0.9
Accuracy


               0.88
                                                                                                                    measures, as defined by Mille et al. (2018). The
               0.86                                                                                                 first line reports the official performance of the
               0.84
                                                                                                                    DipInfo-Unito realizer in the SRST18 for Ital-
                                                                                Gold-SRST18 training set
                                                                     Gold-SRST18+Silver-WaC trainining set          ian. The last line reports the best performances
               0.82
                      0       10        20     30     40      50
                                                            Epochs
                                                                      60         70         80         90     100
                                                                                                                    achieved on Italian by the participants to SRST18
                                                                                                                    (Mille et al., 2018). The other lines report the per-
Figure 3: The trend of the Morphology Accuracy                                                                      formance of the DipInfo-UniTo realizer by consid-
on the SRST18 development set with respect to the                                                                   ering various combination of the gold and silver
DNN training epochs.                                                                                                corpora. The results show a clear improvement

                                                                                                                           ListNet   Morpho   BLEU-4     DIST    NIST
                                                                                                                            Gsrst     Gsrst     24.61    36.11    8.25
                                                                                                                              G        G        36.40    32.80    9.27
                                                                                                                              G       G+S       36.60    32.70    9.30
SRST18 and Gold-SRST18+Silver-WaC corpora.                                                                                  G+S        G        36.40    32.80    9.27
In particular, in order to investigate the influence                                                                        G+S       G+S       36.60    32.70    9.30
of the learning rate parameter (LR) in the learning                                                                           -         -       44.16    58.61    9.11
of the ListNet model, we reported the τ trends for
LR = 5 · 10−5 (the value originally used for the                                                                    Table 1: The performances of the systems with
official SRST18 submission), LR = 5 · 10−6 and                                                                      respect to the BLUE-4, DIST, NIST measures.
LR = 5 · 10−7 . It is quite clear that the value of
LR has a great impact on the performance of the                                                                     for the word order module (note that the DIST
word ordering, and that LR = 5 · 10−5 is not ap-                                                                    metric is character-based, therefore it is more sen-
propriate to reach the best performance. This ex-                                                                   sitive to the morphological variation than NIST
plains the poor performance of the DipInfo-UniTo                                                                    and BLEU-4). In contrast, the morphology sub-
realizer in the SRST18 competition (Table 1). In-                                                                   module performance seems to be unaffected by
deed, the typical zigzag shape of the curve sug-                                                                    the use of a larger training corpus. This effect
gests a sort of loop in the gradient learning algo-                                                                 could be due different causes. Errors are present in
rithm. In contrast, the LR = 5 · 10−6 seems to                                                                      the silver standard training set, and it is not clear
reach a plateau value after the 100th epoch with                                                                    to what extent the morphology analysis is correct
both corpora used in the experiments. We used the                                                                      5
                                                                                                                        An implementation of the model by (Aharoni and Gold-
system tuned with this value of the learning rate to                                                                berg, 2017) is freely available as https://github.com/
evaluate the global performance of the realizer.                                                                    roeeaharoni/morphological-reinflection
with respect to the syntactic analysis. The other        Marco Aldinucci, Sergio Rabellino, Marco Pironti, Fil-
possible cause is the neural model itself. Indeed,        ippo Spiga, Paolo Viviani, Maurizio Drocco, Marco
                                                          Guerzoni, Guido Boella, Marco Mellia, Paolo Mar-
Aharoni and Goldberg (2017) report a plateau in
                                                          gara, Idillio Drago, Roberto Marturano, Guido
performance after feeding it with relatively small        Marchetto, Elio Piccolo, Stefano Bagnasco, Ste-
datasets. The DipInfo-UniTo realizer performs             fano Lusso, Sara Vallero, Giuseppe Attardi, Alex
better than the best systems of the SRST18 chal-          Barchiesi, Alberto Colla, and Fulvio Galeazzi.
lenge for one out of three metrics (NIST).                2018. Hpc4ai, an ai-on-demand federated platform
                                                          endeavour. In ACM Computing Frontiers, Ischia,
                                                          Italy, May.
4       Conclusion and Future Work
                                                         Marco Baroni, Silvia Bernardini, Adriano Ferraresi,
In this paper, we considered the problem of               and Eros Zanchetta. 2009. The wacky wide web:
analysing the impact of the training data and pa-         a collection of very large linguistically processed
rameters tuning on the (modular and global) per-          web-crawled corpora. Language Resources and
formance of the DipInfo-UniTo realizer. We com-           Evaluation, 43(3):209–226, September.
putationally proved that the DipInfo-UniTo real-         Valerio Basile and Alessandro Mazzei. 2018. The
izer can gives competitive results (i) by augment-         dipinfo-unito system for srst 2018. In Proceedings
ing the training data set with automatically anno-         of the First Workshop on Multilingual Surface Reali-
                                                           sation, pages 65–71. Association for Computational
tated sentences, and (ii) by tuning the learning pa-
                                                           Linguistics.
rameters of the neural models.
   In future work, we intend to resolve the main         Valerio Basile. 2015. From Logic to Language :
lack of our approach, that is the impossibility to re-     Natural Language Generation from Logical Forms.
                                                           Ph.D. thesis, University of Groningen, Netherlands.
alize non-projective sentences. Moreover, further
optimization of both neural models will be carried       Bernd Bohnet, Anders Björkelund, Jonas Kuhn, Wolf-
out on a new high-performance architecture (Ald-           gang Seeker, and Sina Zarrieß. 2012. Generating
                                                           non-projective word order in statistical linearization.
inucci et al., 2018), by executing a systematic grid-      In Proceedings of the 2012 Joint Conference on
search over the hyperparameter space, namely the           Empirical Methods in Natural Language Process-
regularization factor and weight initialization for        ing and Computational Natural Language Learning,
ListNet, and the specific DNN hyperparameters              pages 928–939. Association for Computational Lin-
for the morphology module.                                 guistics.
                                                         Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and
Aknowledgment                                              Hang Li. 2007. Learning to rank: From pairwise ap-
                                                           proach to listwise approach. In Proceedings of the
We thank the GARR consortium which kindly al-              24th International Conference on Machine Learn-
lowed to use to the GARR Cloud Platform6 to run            ing, ICML ’07, pages 129–136, New York, NY,
some of the experiments described in this paper.           USA. ACM.
Valerio Basile was partially funded by Progetto di       M. G. Kendall. 1938. A new measure of rank correla-
Ateneo/CSP 2016 (Immigrants, Hate and Preju-               tion. Biometrika, 30(1/2):81–93.
dice in Social Media, S1618 L2 BOSC 01).
                                                         Simon Mille, Anja Belz, Bernd Bohnet, Yvette Gra-
Alessandro Mazzei was partially supported by the
                                                           ham, Emily Pitler, and Leo Wanner. 2018. The first
HPC4AI project, funded by the Region Piedmont              multilingual surface realisation shared task (sr’18):
POR-FESR 2014-20 programme (INFRA-P call).                 Overview and evaluation results. In Proceedings of
                                                           the First Workshop on Multilingual Surface Reali-
                                                           sation, pages 1–12. Association for Computational
References                                                 Linguistics.

Roee Aharoni and Yoav Goldberg. 2017. Morphologi-        Ehud Reiter and Robert Dale. 2000. Building Natural
  cal inflection generation with hard monotonic atten-     Language Generation Systems. Cambridge Univer-
  tion. In Proceedings of the 55th Annual Meeting of       sity Press, New York, NY, USA.
  the Association for Computational Linguistics, ACL
  2017, pages 2004–2015.                                 Milan Straka and Jana Straková. 2017. Tokenizing,
                                                           pos tagging, lemmatizing and parsing ud 2.0 with
Rami Al-Rfou’, Bryan Perozzi, and Steven Skiena.           udpipe. In Proceedings of the CoNLL 2017 Shared
  2013. Polyglot: Distributed word representations         Task: Multilingual Parsing from Raw Text to Univer-
  for multilingual nlp. In CoNLL, pages 183–192.           sal Dependencies, pages 88–99, Vancouver, Canada,
  ACL.                                                     August. Association for Computational Linguistics.
    6
        https://cloud.garr.it