Neural Surface Realization for Italian Valerio Basile Alessandro Mazzei Dipartimento di Informatica Dipartimento di Informatica Università degli Studi di Torino Università degli Studi di Torino Corso Svizzera 185, 10153 Torino Corso Svizzera 185, 10153 Torino basile@di.unito.it.com mazzei@di.unito.it Abstract We consider the surface realization of un- ordered Universal Dependency (UD) trees, i.e., We present an architecture based on neural syntactic structures where the words of a sentence networks to generate natural language are connected by labeled directed arcs in a tree- from unordered dependency trees. The like fashion. The labels on the arcs indicate the task is split into the two subproblems of syntactic relation holding between each word and word order prediction and morphology its dependent words (Figure 1a). We approach inflection. We test our model gold corpus the surface realization task in a supervised statis- (the Italian portion of the Universal De- tical setting. In particular, we draw inspiration pendency treebanks) and an automatically from Basile (2015) by dividing the task into the parsed corpus from the Web. two independent subtasks of word order predic- (Italian) Questo lavoro introduce tion and morphology inflection prediction. Two un’architettura basata su reti neurali neural network-based models run in parallel on the per generare frasi in linguaggio natu- same input structure, and their output is later com- rale a partire da alberi a dipendenze. bined to produce the final surface form. Il processo è diviso nei due sotto- A first version of the system implementing our problemi dell’ordinamento di parole e proposed architecture (called the DipInfo-UniTo dell’inflessione morfologica, per i quali realizer) was submitted to the shallow track of the la nostra architettura prevede due modelli Surface Realization Shared Task 2018 (Mille et al., indipendenti, il cui risultato è combinato 2018). The main research goal of this paper is to nella fase finale. Abbiamo testato il provide a critical analysis for tuning the training modello usando un gold corpus e un silver data and learning parameters of the DipInfo-UniTo corpus ottenuto dal Web. realizer. 1 Introduction 2 Neural network-based Surface Realization Natural Language Generation is the process of producing natural language utterances from an ab- In the following sections, we detail the two neural stract representation of knowledge. As opposed to networks employed to solve the subtasks of word Natural Language Understanding, where the input order prediction (2.1) and morphology inflection is well-defined (typically a text or speech segment) (2.2) respectively. and the output may vary in terms of complexity and scope of the analysis, in the generation process 2.1 Word Ordering the input can take different forms and levels of ab- We reformulate the problem of sentence-wise straction, depending on the specific goals and ap- word ordering in terms of reordering the subtrees plicative scenarios. However, the input structures of its syntactical structure. The algorithm is com- for generation should be at least formally defined. posed of three steps: i) splitting the unordered tree In this work we focus on the final part of the into single-level unordered subtrees; ii) predicting standard NLG pipeline defined by Reiter and Dale the local word order for each subtree; iii) recom- (2000), that is, surface realization, the task of pro- posing the single-level ordered subtrees into a sin- ducing natural language from formal abstract rep- gle multi-level ordered tree to obtain the global resentations of sentences’ meaning and syntax. word order. In the first step, we split the original unordered permutation of an ordered list. This approxima- universal dependency multilevel tree into a num- tion is necessary to keep the problem tractable by ber of single-level unordered trees, where each avoiding the exponential explosion of the number subtree is composed by a head (the root) and all of permutations. Formally, the top one probability its dependents (the children), similarly to Bohnet of an object j is defined as et al. (2012). An example is shown in Figure 1: X Ps (j) = Ps (π) π(1)=j,π∈Ωn ROOT that is, the sum of the probabilities of all the pos- sible permutations of n objects (denoted as Ωn ) contenere where j is the first element. s = (s1 , ..., sn ) is a given list of scores, i.e., the position of elements in the list. Considering two permutations of the same opera prodotto . list y and z (for instance, the predicted order and the reference order) their distance is computed us- ing cross entropy. The distance measure and the suo numeroso chimico tossico top one probabilities of the list elements are used (a) Tree corresponding to the Italian sentence “Numerose in the loss function: sue opere contengono prodotti chimici tossici.” (“Many n X of his works contain toxic chemicals.”) L(y, z) = − Py (j)log(Pz (j)) contenere opera prodotto j=1 prodotto . opera suo numeroso chimico tossico The list-wise loss function is plugged into a lin- (b) Three subtrees extracted from the main tree. ear neural network model to provide a learning environment. ListNet takes as input a sequence Figure 1: Splitting the input tree into subtrees to of ordered lists of feature vectors (the features are extract lists of items for learning to rank. encoded as numeric vectors). The weights of the network are iteratively adjusted by computing a from the (unordered) tree representing the sen- list-wise cost function that measure the distance tence “Numerose sue opere contengono prodotti between the reference ranking and the prediction chimici tossici.” (1a), each of its component sub- of the model and passing its value to the gradient trees (limited to one-level dependency) is consid- descent algorithm for optimization of the parame- ered separarately (1b). The head and the depen- ters. dents of each subtree form an unordered list of lex- The choice of features for the supervised learn- ical items. Crucially, we leverage the flat structure ing to rank component is a critical point of our of the subtrees in order to extract structures that solution. We use several word-level features en- are suitable as input to the learning to rank algo- coded as one-hot vectors, namely: the universal rithm in the next step of the process. POS-tag, the treebank specific POS tag, the mor- In the second step of the algorithm, we predict phology features and the head-status of the word the relative order of the head and the dependents (head of the single-level tree vs. leaf). Further- of each subtree with a learning to rank approach. more, we included word representations, differen- We employ the list-wise learning to rank algorithm tiating between content words and function words: ListNet, proposed by Cao et al. (2007). The rela- for open-class word lemmas (content words) we tively small size of the lists of items to rank al- added the corresponding language-specific word lows us to use a list-wise approach, as opposed to embedding to the feature vector, from the pre- pair-wise or poin-twise approaches, while keeping trained multilingual model Polyglot (Al-Rfou’ et the computation times manageable. ListNet uses a al., 2013). Closed-class word lemmas (function list-wise loss function based on top one probabil- words) are encoded as one-hot bags of words vec- ity, i.e., the probability of an element of being the tors. An implementation of the feature encoding first one in the ranking. The top one probability for the word ordering module of our architecture model approximates the permutation probability is available online1 . model that assigns a probability to each possible 1 https://github.com/alexmazzei/ud2ln In the third step of the word ordering algorithm, the performances of the DipInfo-UniTo realizer, we reconstruct the global (i.e. sentence-level) or- namely training data and learning parameters set- der from the local order of the one-level trees un- tings. In Basile and Mazzei (2018), the hard- der the hypothesis of projectivity2 — see Basile ware limitations did not allow for an extensive and Mazzei (2018) for details on this step. experimentation dedicated to the optimization of the realizer performances. In this paper, we aim 2.2 Morphology Inflection to bridge this gap by experimenting with higher The second component of our architecture is re- computing capabilities, specifically a virtualized sponsible for the morphology inflection. The task GNU/Linux box with 16-core and 64GB of RAM. is formulated as an alignment problem between characters that can be modeled with the sequence 3.1 Training Data to sequence paradigm. We use a deep neural net- For our experiments, we used the four Italian work architecture based on a hard attention mech- corpora annotated with Universal Dependencies anism. The model has been recently introduced by available on the Universal Dependency reposito- Aharoni and Goldberg (2017). The model consists ries3 . In total, they comprise 270,703 tokens and of a neural network in an encoder-decoder setting. 12,838 sentences. We have previously used this However, at each step of the training, the model corpus for the training of the DipInfo-UniTo real- can either write a symbol to the output sequence, izer that participated to the SRST18 competition or move the attention pointer to the next state of (Basile and Mazzei, 2018). We refer to this corpus the sequence. This mechanism is meant to model as Gold-SRST18 henceforth. the natural monotonic alignment between the in- Moreover, we used a larger corpus extracted put and output sequences, while allowing the free- from ItWaC, a large unannotated corpus of Ital- dom to condition the output on the entire input se- ian (Baroni et al., 2009). We parsed ItWaC with quence. UDpipe (Straka and Straková, 2017), and selected We employ all the morphological features pro- a random sample of 9,427 sentence (274,115 to- vided by the UD annotation and the dependency kens). We refer to this corpus as Silver-WaC relation binding the word to its head, that is, we henceforth. transform the training files into a set of struc- 3.2 Word Ordering Performances tures ((lemma, f eatures), f orm) in order to learn the neural inflectional model associating a We trained the word order prediction module of (lemma, f eatures) to the corresponding f orm. our system4 on the Gold-SRST18 corpus as well An example of training instance for our morphol- as on the larger corpus created by concatenating ogy inflection module is the following: Gold-SRST18 and Silver-WaC. The performance of the ListNet algorithm for lemma: artificiale word ordering is given in terms of average features: Kendall’s Tau (Kendall, 1938, τ ), a measure of uPoS=ADJ rank correlation used to give a score to each of the xPoS=A rankings predicted by our model for every subtree rel=amod (Figure 2). τ measures the similarity between two Number=Plur rankings by counting how many pairs of elements form: artificiali are swapped with respect to the original ordering Corresponding to the word form artificiali, an in- out of all possible pairs of n elements: flected form (plural) of the lemma artificiale (arti- ficial). #concordant pairs − #discordant tpairs τ= 1 2 n(n − 1) 3 Evaluation Therefore, τ ranges from -1 to 1. In this section, we present an evaluation of the In Figure 2 we reported the τ values obtained models presented in Section 2, with particular at various epochs of learning for both the Gold- consideration for two crucial points influencing 3 http://universaldependencies.org/ 2 4 As a consequence of the design of our approach, the Our implementation of ListNet featuring a regularization DipInfo-UniTo realizer cannot predict the correct word order parameter to prevent overfitting is available at https:// for non-projective sentences. github.com/valeriobasile/listnet 0.8 3.3 Morphology Inflection Performances 0.7 0.6 In order to understand the impact of the Silver- 0.5 WaC corpus on the global performance of the sys- tem, we trained the DNN system for morphology Kendall-Tau 0.4 inflection5 both on the Gold-SRST18 corpus and 0.3 0.2 0.1 on the larger corpus composed by Gold-SRST18+ 0 Gold-SRST18 training set,LR=0.00005 Gold-SRST18 training set,LR=0.000005 Silver-WaC. In Figure 3 we reported the accuracy -0.1 Gold-SRST18 training set,LR=0.0000005 Gold-SRST18+Silver-WaC training set,LR=0.000005 on the SRST18 development set for both the cor- -0.2 0 10 20 30 40 50 60 70 80 90 100 pora. A first analysis of the trend shows little im- Epochs provement to the global performance of the real- Figure 2: The trend of the τ value with respect to ization from the inclusion of additional data (see the ListNet iteration. the discussion in the next section). 3.4 Global Surface Realization Performances Finally, we evaluate the end-to-end performance 0.96 of our systems by combining the output of the two 0.94 modules and submitting it to the evaluation scorer 0.92 of the Surface Realization Shared Task. In Ta- ble 1 we report the performance of various tests systems with respect to the BLUE-4, DIST, NIST 0.9 Accuracy 0.88 measures, as defined by Mille et al. (2018). The 0.86 first line reports the official performance of the 0.84 DipInfo-Unito realizer in the SRST18 for Ital- Gold-SRST18 training set Gold-SRST18+Silver-WaC trainining set ian. The last line reports the best performances 0.82 0 10 20 30 40 50 Epochs 60 70 80 90 100 achieved on Italian by the participants to SRST18 (Mille et al., 2018). The other lines report the per- Figure 3: The trend of the Morphology Accuracy formance of the DipInfo-UniTo realizer by consid- on the SRST18 development set with respect to the ering various combination of the gold and silver DNN training epochs. corpora. The results show a clear improvement ListNet Morpho BLEU-4 DIST NIST Gsrst Gsrst 24.61 36.11 8.25 G G 36.40 32.80 9.27 G G+S 36.60 32.70 9.30 SRST18 and Gold-SRST18+Silver-WaC corpora. G+S G 36.40 32.80 9.27 In particular, in order to investigate the influence G+S G+S 36.60 32.70 9.30 of the learning rate parameter (LR) in the learning - - 44.16 58.61 9.11 of the ListNet model, we reported the τ trends for LR = 5 · 10−5 (the value originally used for the Table 1: The performances of the systems with official SRST18 submission), LR = 5 · 10−6 and respect to the BLUE-4, DIST, NIST measures. LR = 5 · 10−7 . It is quite clear that the value of LR has a great impact on the performance of the for the word order module (note that the DIST word ordering, and that LR = 5 · 10−5 is not ap- metric is character-based, therefore it is more sen- propriate to reach the best performance. This ex- sitive to the morphological variation than NIST plains the poor performance of the DipInfo-UniTo and BLEU-4). In contrast, the morphology sub- realizer in the SRST18 competition (Table 1). In- module performance seems to be unaffected by deed, the typical zigzag shape of the curve sug- the use of a larger training corpus. This effect gests a sort of loop in the gradient learning algo- could be due different causes. Errors are present in rithm. In contrast, the LR = 5 · 10−6 seems to the silver standard training set, and it is not clear reach a plateau value after the 100th epoch with to what extent the morphology analysis is correct both corpora used in the experiments. We used the 5 An implementation of the model by (Aharoni and Gold- system tuned with this value of the learning rate to berg, 2017) is freely available as https://github.com/ evaluate the global performance of the realizer. roeeaharoni/morphological-reinflection with respect to the syntactic analysis. The other Marco Aldinucci, Sergio Rabellino, Marco Pironti, Fil- possible cause is the neural model itself. Indeed, ippo Spiga, Paolo Viviani, Maurizio Drocco, Marco Guerzoni, Guido Boella, Marco Mellia, Paolo Mar- Aharoni and Goldberg (2017) report a plateau in gara, Idillio Drago, Roberto Marturano, Guido performance after feeding it with relatively small Marchetto, Elio Piccolo, Stefano Bagnasco, Ste- datasets. The DipInfo-UniTo realizer performs fano Lusso, Sara Vallero, Giuseppe Attardi, Alex better than the best systems of the SRST18 chal- Barchiesi, Alberto Colla, and Fulvio Galeazzi. lenge for one out of three metrics (NIST). 2018. Hpc4ai, an ai-on-demand federated platform endeavour. In ACM Computing Frontiers, Ischia, Italy, May. 4 Conclusion and Future Work Marco Baroni, Silvia Bernardini, Adriano Ferraresi, In this paper, we considered the problem of and Eros Zanchetta. 2009. The wacky wide web: analysing the impact of the training data and pa- a collection of very large linguistically processed rameters tuning on the (modular and global) per- web-crawled corpora. Language Resources and formance of the DipInfo-UniTo realizer. We com- Evaluation, 43(3):209–226, September. putationally proved that the DipInfo-UniTo real- Valerio Basile and Alessandro Mazzei. 2018. The izer can gives competitive results (i) by augment- dipinfo-unito system for srst 2018. In Proceedings ing the training data set with automatically anno- of the First Workshop on Multilingual Surface Reali- sation, pages 65–71. Association for Computational tated sentences, and (ii) by tuning the learning pa- Linguistics. rameters of the neural models. In future work, we intend to resolve the main Valerio Basile. 2015. From Logic to Language : lack of our approach, that is the impossibility to re- Natural Language Generation from Logical Forms. Ph.D. thesis, University of Groningen, Netherlands. alize non-projective sentences. Moreover, further optimization of both neural models will be carried Bernd Bohnet, Anders Björkelund, Jonas Kuhn, Wolf- out on a new high-performance architecture (Ald- gang Seeker, and Sina Zarrieß. 2012. Generating non-projective word order in statistical linearization. inucci et al., 2018), by executing a systematic grid- In Proceedings of the 2012 Joint Conference on search over the hyperparameter space, namely the Empirical Methods in Natural Language Process- regularization factor and weight initialization for ing and Computational Natural Language Learning, ListNet, and the specific DNN hyperparameters pages 928–939. Association for Computational Lin- for the morphology module. guistics. Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Aknowledgment Hang Li. 2007. Learning to rank: From pairwise ap- proach to listwise approach. In Proceedings of the We thank the GARR consortium which kindly al- 24th International Conference on Machine Learn- lowed to use to the GARR Cloud Platform6 to run ing, ICML ’07, pages 129–136, New York, NY, some of the experiments described in this paper. USA. ACM. Valerio Basile was partially funded by Progetto di M. G. Kendall. 1938. A new measure of rank correla- Ateneo/CSP 2016 (Immigrants, Hate and Preju- tion. Biometrika, 30(1/2):81–93. dice in Social Media, S1618 L2 BOSC 01). Simon Mille, Anja Belz, Bernd Bohnet, Yvette Gra- Alessandro Mazzei was partially supported by the ham, Emily Pitler, and Leo Wanner. 2018. The first HPC4AI project, funded by the Region Piedmont multilingual surface realisation shared task (sr’18): POR-FESR 2014-20 programme (INFRA-P call). Overview and evaluation results. In Proceedings of the First Workshop on Multilingual Surface Reali- sation, pages 1–12. Association for Computational References Linguistics. Roee Aharoni and Yoav Goldberg. 2017. Morphologi- Ehud Reiter and Robert Dale. 2000. Building Natural cal inflection generation with hard monotonic atten- Language Generation Systems. Cambridge Univer- tion. In Proceedings of the 55th Annual Meeting of sity Press, New York, NY, USA. the Association for Computational Linguistics, ACL 2017, pages 2004–2015. Milan Straka and Jana Straková. 2017. Tokenizing, pos tagging, lemmatizing and parsing ud 2.0 with Rami Al-Rfou’, Bryan Perozzi, and Steven Skiena. udpipe. In Proceedings of the CoNLL 2017 Shared 2013. Polyglot: Distributed word representations Task: Multilingual Parsing from Raw Text to Univer- for multilingual nlp. In CoNLL, pages 183–192. sal Dependencies, pages 88–99, Vancouver, Canada, ACL. August. Association for Computational Linguistics. 6 https://cloud.garr.it