Parsing Italian texts together is better than parsing them alone! Oronzo Antonelli Fabio Tamburini DISI, University of Bologna, Italy FICLIT, University of Bologna, Italy antonelli.oronzo@gmail.it fabio.tamburini@unibo.it Abstract By looking at the cited papers we can observe that they evaluated the state-of-the-art parsers be- English. In this paper we present a work fore the “neural net revolution” not including the aimed at testing the most advanced, state- last improvements proposed by new research stud- of-the-art syntactic parsers based on deep ies. neural networks (DNN) on Italian. We The goal of this paper is twofold: first, we made a set of experiments by using the would like to test the effectiveness of parsers based Universal Dependencies benchmarks and on the newly-proposed technologies, mainly deep propose a new solution based on ensem- neural networks, on Italian, and, second, we would ble systems obtaining very good perfor- like to propose an ensemble system able to further mances. improve the neural parsers performances when parsing Italian texts. Italiano. In questo contributo presentia- mo alcuni esperimenti volti a verificare 2 The Neural Parsers le prestazioni dei più avanzati parser sintattici sull’italiano utilizzando i tree- We considered nine state of the art parsers repre- bank disponibili nell’ambito delle Univer- senting a wide range of contemporary approaches sal Dependencies. Proponiamo inoltre un to dependency parsing whose architectures are nuovo sistema basato sull’ensemble par- based on neural network models (see Table 1). We sing che ha mostrato ottime prestazioni. set-up each parser using the data from the Italian Universal Dependencies (Nivre et al., 2016) tree- bank, UD Italian 2.1 (general texts) and UD Italian 1 Introduction PoSTWITA 2.2 (tweets). For all parsers, we used the default settings for training, following the rec- Syntactic parsing of morphologically rich lan- ommendation of the developers. guages like Italian often poses a number of hard In Chen and Manning (2014) dense features are challenges. Various works applied different kinds used to learn representations of words, tags and of freely available parsers on Italian training them labels using a neural network classifier in order using different resources and different methods for to take parsing decisions within a transition-based comparing their results (Lavelli, 2014; Alicante greedy model. To address some limitations, in An- et al., 2015; Lavelli, 2016) and gather a clear pic- dor et al. (2016) the authors augmented the parser ture of the syntactic parsing task performances for model with a beam search and a conditional ran- the Italian language. In this direction seems rel- dom field loss objective. The work of Balles- evant to cite the EVALITA1 periodic campaigns teros et al. (2015) extends the parser defined in for the evaluation of constituency and dependency Dyer et al. (2015) introducing character-level rep- parsers devoted to the syntactic analysis of Italian resentation of words using bidirectional LSTMs (Bosco and Mazzei, 2011; Bosco et al., 2014). to improve the performance of stack-LSTM model Other studies regarding the syntactic parsing which learn representations of the parser state. of Italian tried to enhance the parsing perfor- In Kiperwasser and Goldberg (2016) the bidirec- mances by building some kind of ensemble sys- tional LSTMs recurrent output vector for each tems (Lavelli, 2013; Mazzei, 2015). word is concatenated with any possible heads re- 1 http://www.evalita.it current vector, and the result is used as input to a multi-layer perceptron (MLP) network that scores After the influential paper from Reimers and each resulting edge. Cheng et al. (2016) pro- Gurevych (2017) it is clear to the community that pose a bidirectional attention model which uses reporting a single score for each DNN training ses- two additional unidirectional RNN, called left- sion could be heavily affected by the system ini- right and right-left query component. Based on tialisation point and we should instead report the Kiperwasser and Goldberg (2016) and Cheng et al. mean and standard deviation of various runs with (2016) model, in Dozat and Manning (2017) a the same setting in order to get a more accurate biaffine attention mechanism is used, instead of picture of the real systems performances and make traditional MLP-based attention. The model pro- more reliable comparisons between them. posed in Nguyen et al. (2017) train a neural net- Table 2 shows the parsers performances on work model that learn jointly POS tagging and the test set for the three setups described above graph-based dependency parsing. The model uses executing the training/validation/test cycle for 5 a bidirectional LSTM to learn POS tagging and the times. In any setup the DM17 parser exhibits the Kiperwasser and Goldberg (2016) approach for best performances, notably very high for general dependency parsing. Shi et al. (2017a,b) described Italian. As we can expect, the performances on a parser that combines three parsing paradigms us- setup1 were much lower than that for setup0 due ing a dynamic programming approach. to the intrinsic difficulties of parsing tweets and to the scarcity of annotated tweets for training. Join- Parser Ref.-Abbreviation Method Parsing (Chen and Manning, 2014) - Tb: a-s Greedy ing the two datasets in the setup2 allowed to get CM14 a relevant gain in parsing tweets even if we added (Ballesteros et al., 2015) - Tb: a-s Be-se out-of-domain data. For these reasons, for all the BA15 (Kiperwasser and Goldberg, 2016)- Tb: a-h Greedy following experiments, we abandoned the setup1 KG16:T because it seemed more relevant to use the joined (Kiperwasser and Goldberg, 2016)- Gb: a-f Eisner data (setup2) and compare them to setup0. KG16:G (Andor et al., 2016) - Tb: a-s Beam-S AN16 3 An Ensemble of Neural Parsers (Cheng et al., 2016) - Gb: a-f cle CH16 The D EPENDA BLE tool in Choi et al. (2015) re- (Dozat and Manning, 2017) - Gb: a-f cle ports ensemble upper bound performance assum- DM17 (Shi et al., 2017a,b)- Tb: a-h./ Greedy ing that, given the parsers outputs, the best tree SH17 -eager can be identified by an oracle “M ACRO” (M A), or Gb: a-f Eisner that the best arc can be identified by another oracle (Nguyen et al., 2017) - Gb: a-f Eisner NG17 “M ICRO” (mi). Table 3 shows that, by applying these oracles, we have plenty of space for improv- Table 1: All the neural parsers considered in ing the performances by building some kind of en- this study with their fundamental features as well semble system able to cleverly choose the correct as their abbreviations used throughout the paper. information from the different parsers outputs and In this table “Tb/Gb” means “Transition/Graph- combine them improving the final solution. This based”, “Beam-S” means “Beam-search” and “a- observation motivates our proposal. s/h/f” means “arc-standard/hybrid/factored”. To combine the parser outputs we used the fol- lowing ensemble schemas: We trained, validated and tested the nine con- sidered parsers, as well as all the proposed exten- • Voting: Each parser contributes by assigning sions, by considering three different setups: a vote on every dependency edge as described in Zeman and Žabokrtský (2005). With the • setup0: only the UD Italian 2.1 dataset; majority approach the dependency tree could • setup1: only the UD Italian PoSTWITA 2.2 be ill-formed, in this case using the switching dataset; approach the tree is replaced with the output of the first parser. • setup2: UD Italian 2.1 dataset joined with the UD Italian PoSTWITA 2.2 dataset (train and • Reparsing: As described in Sagae and Lavie validation sets) keeping the test set of PoST- (2006) together with Hall et al. (2007) a MST WITA 2.2; algorithm is used to reparse a graph where setup0 Chu-Liu/Edmons are used: equally weighted Valid. Ita Test Ita (w2); weighted according to the total la- UAS LAS UAS LAS CM14 88.20/0.18 85.46/0.14 89.33/0.17 86.85/0.22 beled accuracy on the validation set (w3); BA15 91.15/0.11 88.55/0.23 91.57/0.38 89.15/0.33 weighted according to labeled accuracy per KG16:T 91.17/0.29 88.42/0.24 91.21/0.33 88.72/0.24 coarse grained PoS tag on the validation set KG16:G 91.85/0.27 89.23/0.31 92.04/0.18 89.65/0.10 AN16 85.52/0.34 77.67/0.30 87.70/0.31 79.48/0.24 (w4). CH16 92.42/0.00 89.60/0.00 92.82/0.00 90.26/0.00 DM17 93.37/0.27 91.37/0.24 93.72/0.14 91.84/0.18 • Distilling: In Kuncoro et al. (2016) the au- SH17 89.67/0.24 85.05/0.24 89.89/0.29 84.55/0.30 thors train a distillation parser using a loss NG17 90.37/0.12 87.19/0.21 90.67/0.15 87.58/0.11 objective with a cost that incorporates ensem- setup1 ble uncertainty estimates for each possible at- Valid. PoSTW Test PoSTW UAS LAS UAS LAS tachment. CM14 81.03/0.17 75.24/0.30 81.50/0.28 76.07/0.17 BA15 83.44/0.20 77.70/0.25 84.06/0.38 78.64/0.44 4 Results KG16:T 77.38/0.14 68.81/0.25 77.41/0.43 69.13/0.43 KG16:G 78.81/0.23 70.14/0.33 78.78/0.44 70.52/0.51 Tables 4, 7 and 9 show the performances of the en- AN16 77.74/0.25 66.63/0.16 77.78/0.33 67.21/0.30 CH16 84.78/0.00 78.51/0.00 86.12/0.00 79.89/0.00 sembles built on the best results on validation set DM17 85.01/0.16 78.80/0.09 86.26/0.16 80.40/0.19 obtained in the 5 training/test cycles considering SH17 80.52/0.18 73.71/0.14 81.11/0.29 74.53/0.26 both setup0 and setup2. Table 6 reports the num- NG17 82.02/0.11 75.20/0.24 82.74/0.39 76.22/0.41 ber of malformed trees for the majority strategy. setup2 Valid. Ita+PoSTW Test PoSTW Table 5 and 8 report the number of cases when UAS LAS UAS LAS the ensemble combination output differs from the CM14 85.52/0.13 81.51/0.05 82.62/0.24 77.45/0.23 baseline, including both labeled (L) and unla- BA15 87.85/0.13 83.80/0.12 85.15/0.29 80.12/0.27 KG16:T 83.89/0.23 77.77/0.26 80.47/0.36 72.92/0.46 beled (U) outputs. On the average the percent- KG16:G 84.70/0.14 78.41/0.14 81.41/0.37 73.49/0.19 age of different unlabeled output varies from 2% AN16 82.95/0.33 73.46/0.37 79.81/0.27 69.19/0.19 to 15% with respect to baseline. For the best result CH16 89.16/0.00 84.56/0.00 86.85/0.00 80.93/0.00 DM17 89.72/0.10 85.85/0.13 87.22/0.24 81.65/0.21 (DM17+ALL) the difference on setup0 and setup2 SH17 85.85/0.36 80.00/0.39 83.12/0.50 76.38/0.38 is about 4%. NG17 86.81/0.04 82.13/0.09 84.09/0.07 78.02/0.11 The results of the voting approach reported in Table 4 shows that the majority strategy is slightly Table 2: Mean/standard deviation of UAS/LAS for better than the switching strategy, although it must each parser and for the different setups by repeat- be taken into account that there might be ill- ing the experiments 5 times. All the results are sta- formed dependency trees for the former strategy. tistically significant (p < 0.05) and the best values The percentage of ill-formed trees on valid./test are showed in boldface. set vary from a minimum of 2% to a maximum Validation Test of 8%. For this reasons the majority strategy should be used when it is followed by a man- UAS LAS UAS LAS ual correction phase. The switching strategy per- setup0 forms well if the first parser of voters is one of the mi 98.30% 97.82% 98.08% 97.72% best parsers, in fact the combinations AN16+ALL MA 96.62% 95.10% 96.31% 94.82% and AN16+CM14+SH17 have worst performance setup2 than the counterparts which using the best parser mi 97.08% 96.02% 96.32% 94.73% (DM17) as the first voter. Overall, the highest MA 94.62% 91.29% 93.27% 88.50% performance is achieved using all parsers together Table 3: Results obtained by building an ensemble with DM17 as the first voter. For setup0 the in- system based on the oracles mi e M A and consid- creases are +0.19% in UAS e +0.38% in LAS, ering all parsers. while in setup2 are +0.92% in UAS e +2.47% in LAS with respect to the best single parser (again DM17). each word in the sentence is a node. The The results of the reparsing approach reported MSTs algorithms used are Chu-Liu/Edmons in Table 7 shows that the Chu-Liu/Edmonds al- (cle) and Eisner as reported in McDonald gorithm is slightly better than the Eisner algo- et al. (2005). Three weighting strategies for rithm. In this case, the choice of which strategy setup0 setup0 Validation Test Validation Test Voters/Strategy UAS LAS UAS LAS /11.908 /10.417 DM17+CH16+BA15/maj. 94.20% 92.27% 93.77% 92.13% Voters/Strategy U L U L DM17+CH16+BA15/swi. 94.11% 92.16% 93.79% 92.14% DM17+CH16+BA15/maj. 208 61 188 46 AN16+CM14+SH17/maj. 90.43% 87.96% 91.03% 88.47% DM17+CH16+BA15/swi. 192 52 175 39 AN16+CM14+SH17/swi. 89.44% 86.77% 90.17% 87.43% AN16+CM14+SH17/maj. 1.006 424 783 336 DM17+CM14+SH17/maj. 93.84% 92.03% 93.82% 92.27% AN16+CM14+SH17/swi. 1.130 489 870 371 DM17+CM14+SH17/swi. 93.76% 91.94% 93.82% 92.25% DM17+CM14+SH17/maj. 170 37 139 15 AN16+ALL/maj. 94.37% 92.65% 93.83% 92.27% DM17+CM14+SH17/swi. 157 33 129 13 AN16+ALL/swi. 93.99% 92.15% 93.43% 91.73% AN16+ALL/maj. 382 126 328 105 DM17+ALL/maj. 94.42% 92.67% 93.94% 92.41% AN16+ALL/swi. 460 164 386 133 DM17+ALL/swi. 94.38% 92.60% 93.91% 92.37% DM17+ALL/maj. 356 117 282 81 DM17 (baseline) 93.74% 91.66% 93.75% 92.03% DM17+ALL/swi. 312 97 255 72 setup2 setup2 Validation Test Validation Test Voters/Strategy UAS LAS UAS LAS /24.243 /12.668 DM17+CH16+BA15/maj. 90.57% 87.16% 88.21% 83.64% Voters/Strategy U L U L DM17+CH16+BA15/swi. 90.51% 87.10% 88.13% 83.51% DM17+CH16+BA15/maj. 597 219 470 213 AN16+CM14+SH17/maj. 86.90% 83.60% 84.09% 79.78% DM17+CH16+BA15/swi. 521 185 394 172 AN16+CM14+SH17/swi. 86.01% 82.50% 82.58% 77.94% AN16+CM14+SH17/maj. 2.757 1.329 1.805 941 DM17+CM14+SH17/maj. 90.35% 87.21% 88.07% 83.64% AN16+CM14+SH17/swi. 2.976 1.429 1.986 1.033 DM17+CM14+SH17/swi. 90.27% 87.11% 87.99% 83.52% DM17+CM14+SH17/maj. 490 140 337 93 AN16+ALL/maj. 90.30% 87.26% 88.36% 84.13% DM17+CM14+SH17/swi. 453 121 300 73 AN16+ALL/swi. 89.70% 86.45% 87.46% 83.06% AN16+ALL/maj. 1.377 624 897 440 DM17+ALL/maj. 90.64% 87.60% 88.51% 84.42% AN16+ALL/swi. 1.610 741 1.063 534 DM17+ALL/swi. 90.65% 87.62% 88.50% 84.20% DM17+ALL/maj. 1.156 502 784 378 DM17 (baseline) 89.82% 85.96% 87.59% 81.95% DM17+ALL/swi. 920 374 614 280 Table 4: Results of ensembles using switching and Table 5: Numbers of cases when there is a dif- majority approaches on the best models in setup0 ferent output between the ensemble systems, us- and setup2. The baseline is defined by the best ing switching and majority, and the baseline Dozat results of Dozat and Manning (2017). and Manning (2017). setup0 setup2 to use must take into account if we want to allow Voters Valid. Test Valid. Test /564 /482 /1235 /674 non-projectivity or not. The percentage of non- DM17+CH16+BA15 9 7 31 31 projective dependency trees on valid./test set for AN16+CM14+SH17 45 25 88 77 Chu-Liu/Edmonds vary from a minimum of 7% to DM17+CM14+SH17 6 6 19 23 AN16+ALL 18 17 73 63 a maximum of 12% compared with the average for DM17+ALL 17 11 75 57 the Italian corpora of 4%. Overall, the highest per- formances are achieved using Chu-Liu/Edmonds Table 6: Number of malformed trees obtained by algorithm. For setup0 the increases are +0.25% using the majority strategy for both setups. in UAS and +0.45% in LAS, while in setup2 are +0.77% in UAS and +2.30% in LAS with respect Thanks to the number of parser models adopted to the best single parser (DM17). in the experiments it has been possible to verify The results of the distilling strategy reported in that the performances of the ensemble models in- Table 9, unlike the previous proposals, show worse crease as the number of parsers grows. outcomes, which score below the baseline. The improvement of LAS is, in most cases, at least twice the value of UAS. This could mean 5 Discussion and Conclusions that ensemble models catch with better precision We have studied the performances of some neu- the type of dependency relations rather than head- ral dependency parsers on generic and social me- dependent relations. dia domain. Using the predictions of each single All the proposed ensemble strategies, except for parser we combined the best outcomes to improve distilling, perform more or less in the same way, the performance in various ways. The ensemble therefore the choice of which strategy to use is models are more efficient on corpora built using due, in part, to the properties that we want to ob- in-domain data (social media), giving an improve- tain on the combined dependency tree. ment of ∼ 1% in UAS and ∼ 2.5% in LAS. Our work is inspired by the work of Mazzei setup0 Setup UAS LAS Validation Test setup0 92.50% (–1.25%) 89.93% (–2.10%) Voters/Strategy UAS LAS UAS LAS DM17+CH16+BA15/cle-w2 93.82% 91.85% 93.54% 91.83% setup2 86.73% (–0.86%) 81.39% (–0.56%) DM17+CH16+BA15/cle-w3 93.89% 91.82% 93.78% 92.06% DM17+CH16+BA15/cle-w4 94.20% 92.28% 93.72% 92.04% Table 9: Results of distilling approach on the best DM17+CH16+BA15/eisner 94.05% 92.05% 93.46% 91.78% models in setup0 and setup2. In brackets are re- ALL/cle-w2 94.31% 92.53% 93.85% 92.23% ALL/cle-w3 94.16% 92.41% 94.00% 92.48% ported the differences between the distilled mod- ALL/cle-w4 94.29% 92.58% 93.95% 92.38% els and the best results of DM17, as baseline. ALL/eisner 94.31% 92.53% 93.95% 92.35% DM17 (baseline) 93.74% 91.66% 93.75% 92.03% setup2 the models used in the ensembles; furthermore we Validation Test have experimented the distilling strategy and eis- Voters/Strategy UAS LAS UAS LAS ner reparsing algorithm. Moreover, we built en- DM17+CH16+BA15/cle-w2 90.33% 86.95% 87.69% 83.31% DM17+CH16+BA15/cle-w3 89.82% 85.96% 87.59% 81.95% sembles on larger datasets using both generic and DM17+CH16+BA15/cle-w4 90.41% 86.99% 87.94% 83.32% social media texts. DM17+CH16+BA15/eisner 90.50% 87.05% 88.04% 83.51% ALL/cle-w2 90.52% 87.53% 88.36% 84.25% ALL/cle-w3 89.90% 86.75% 87.79% 83.54% Acknowledgements ALL/cle-w4 90.42% 87.46% 88.19% 84.11% ALL/eisner 90.45% 87.41% 88.31% 84.08% We gratefully acknowledge the support of DM17 (baseline) 89.82% 85.96% 87.59% 81.95% NVIDIA Corporation with the donation of the Ti- tan Xp GPU used for this research. Table 7: Results of ensembles using reparsing ap- proaches on the best models in setup0 and setup2. References The baseline is again defined by the best results of DM17. Anita Alicante, Cristina Bosco, Anna Corazza, and Alberto Lavelli. 2015. Evaluating italian setup0 parsing across syntactic formalisms and anno- Validation Test tation schemes. In Roberto Basili, Cristina /11.908 /10.417 Bosco, Rodolfo Delmonte, Alessandro Mos- Voters/Strategy UAS LAS UAS LAS DM17+CH16+BA15/cle-w2 360 129 307 90 chitti, and Maria Simi, editors, Harmonization DM17+CH16+BA15/cle-w3 96 0 89 1 and Development of Resources and Tools for DM17+CH16+BA15/cle-w4 267 76 247 52 Italian Natural Language Processing within the DM17+CH16+BA15/eisner 375 130 327 103 ALL/cle-w2 400 131 333 103 PARLI Project, Springer International Publish- ALL/cle-w3 351 108 299 79 ing, Cham, pages 135–159. ALL/cle-w4 383 126 307 87 ALL/eisner 411 133 333 106 Daniel Andor, Chris Alberti, David Weiss, Ali- setup2 aksei Severyn, Alessandro Presta, Kuzman Validation Test Ganchev, Slav Petrov, and Michael Collins. /24.243 /12.668 2016. Globally normalized transition-based Voters/Strategy UAS LAS UAS LAS DM17+CH16+BA15/cle-w2 1.056 496 800 424 neural networks. In Proceedings of the 54th DM17+CH16+BA15/cle-w3 0 0 0 0 Annual Meeting of the Association for Compu- DM17+CH16+BA15/cle-w4 603 264 491 236 DM17+CH16+BA15/eisner 1.047 443 789 376 tational Linguistics (Volume 1: Long Papers). ALL/cle-w2 1.347 599 882 417 ACL, Berlin, Germany, pages 2442–2452. ALL/cle-w3 1.261 537 804 363 ALL/cle-w4 1.274 576 822 389 Miguel Ballesteros, Chris Dyer, and Noah A. ALL/eisner 1.367 607 916 436 Smith. 2015. Improved transition-based parsing by modeling characters instead of words with Table 8: Numbers of cases when there is a differ- lstms. In Proceedings of the 2015 Conference ent output between the ensemble systems, using on Empirical Methods in Natural Language reparsing approaches, and the baseline Dozat and Processing. ACL, Lisbon, Portugal, pages 349– Manning (2017). 359. Cristina Bosco, Felice DellOrletta, Simonetta (2015). Different from his work, we use larger Montemagni, Manuela Sanguinetti, and Maria set of state-of-the-art parsers, all based on neural Simi. 2014. The evalita 2014 dependency pars- networks, in order to gain more diversity among ing task. In Proceedings of the Fourth Inter- national Workshop EVALITA 2014. Pisa, Italy, Transactions of the Association for Computa- pages 1–8. tional Linguistics 4:313–327. Cristina Bosco and Alessandro Mazzei. 2011. The Adhiguna Kuncoro, Miguel Ballesteros, Lingpeng evalita 2011 parsing task. In Working Notes of Kong, Chris Dyer, and Noah A. Smith. 2016. EVALITA 2011, CELCT, Povo, Trento. Distilling an ensemble of greedy dependency parsers into one mst parser. In Proceedings Danqi Chen and Christopher Manning. 2014. A of the 2016 Conference on Empirical Methods fast and accurate dependency parser using neu- in Natural Language Processing. ACL, Austin, ral networks. In Proceedings of the 2014 Texas, pages 1744–1753. Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL, Doha, Alberto Lavelli. 2013. An ensemble model for Qatar, pages 740–750. the evalita 2011 dependency parsing task. In Bernardo Magnini, Francesco Cutugno, Mauro Hao Cheng, Hao Fang, Xiaodong He, Jianfeng Falcone, and Emanuele Pianta, editors, Eval- Gao, and Li Deng. 2016. Bi-directional atten- uation of Natural Language and Speech Tools tion with agreement for dependency parsing. In for Italian. Springer Berlin Heidelberg, Berlin, Proceedings of the 2016 Conference on Empir- Heidelberg, pages 30–36. ical Methods in Natural Language Processing. ACL, Austin, Texas, pages 2204–2214. Alberto Lavelli. 2014. Comparing state-of-the- art dependency parsers for the evalita 2014 de- Jinho D. Choi, Joel Tetreault, and Amanda Stent. pendency parsing task. In Proceedings of the 2015. It depends: Dependency parser compari- Fourth International Workshop EVALITA 2014. son using a web-based evaluation tool. In Pro- Pisa, Italy, pages 15–20. ceedings of the 53rd Annual Meeting of the As- sociation for Computational Linguistics and the Alberto Lavelli. 2016. Comparing state-of-the-art 7th International Joint Conference on Natural dependency parsers on the italian stanford de- Language Processing (Volume 1: Long Papers). pendency treebank. In Proceedings of the Third ACL, Beijing, China, pages 387–396. Italian Conference on Computational Linguis- tics (CLiC-it 2016). Napoli, Italy, pages 173– Timothy Dozat and Christopher D. Manning. 178. 2017. Deep biaffine attention for neural depen- Alessandro Mazzei. 2015. Simple voting algo- dency parsing. In Proceedings of the 2017 In- rithms for italian parsing. In Roberto Basili, ternational Conference on Learning Represen- Cristina Bosco, Rodolfo Delmonte, Alessandro tations. Moschitti, and Maria Simi, editors, Harmoniza- Chris Dyer, Miguel Ballesteros, Wang Ling, tion and Development of Resources and Tools Austin Matthews, and Noah A. Smith. 2015. for Italian Natural Language Processing within Transition-based dependency parsing with stack the PARLI Project, Springer International Pub- long short-term memory. In Proceedings of lishing, Cham, pages 161–171. the 53rd Annual Meeting of the Association for Ryan McDonald, Fernando Pereira, Kiril Ribarov, Computational Linguistics and the 7th Interna- and Jan Hajic. 2005. Non-projective depen- tional Joint Conference on Natural Language dency parsing using spanning tree algorithms. Processing (Volume 1: Long Papers). ACL, In Proceedings of Human Language Technol- Beijing, China, pages 334–343. ogy Conference and Conference on Empiri- Johan Hall, Jens Nilsson, Joakim Nivre, Gülsen cal Methods in Natural Language Processing. Eryigit, Beáta Megyesi, Mattias Nilsson, and ACL, Vancouver, British Columbia, Canada, Markus Saers. 2007. Single malt or blended? pages 523–530. a study in multilingual parser optimization. In Dat Quoc Nguyen, Mark Dras, and Mark John- Proceedings of the CoNLL Shared Task Session son. 2017. A novel neural network model for of EMNLP-CoNLL 2007. ACL, Prague, Czech joint pos tagging and graph-based dependency Republic, pages 933–939. parsing. In Proceedings of the CoNLL 2017 Eliyahu Kiperwasser and Yoav Goldberg. 2016. Shared Task: Multilingual Parsing from Raw Simple and accurate dependency parsing us- Text to Universal Dependencies. ACL, Vancou- ing bidirectional lstm feature representations. ver, Canada, pages 134–142. Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christo- pher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. 2016. Universal dependen- cies v1: A multilingual treebank collection. In Proceedings of the Tenth International Confer- ence on Language Resources and Evaluation (LREC 2016). Nils Reimers and Iryna Gurevych. 2017. Re- porting score distributions makes a difference: Performance study of lstm-networks for se- quence tagging. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. ACL, Copenhagen, Den- mark, pages 338–348. Kenji Sagae and Alon Lavie. 2006. Parser com- bination by reparsing. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers. ACL, Stroudsburg, PA, USA, NAACL-Short ’06, pages 129–132. Tianze Shi, Liang Huang, and Lillian Lee. 2017a. Fast(er) exact decoding and global training for transition-based dependency parsing via a min- imal feature set. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. ACL, Copenhagen, Den- mark, pages 12–23. Tianze Shi, Felix G. Wu, Xilun Chen, and Yao Cheng. 2017b. Combining global models for parsing universal dependencies. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependen- cies. ACL, Vancouver, Canada, pages 31–39. Daniel Zeman and Zdeněk Žabokrtský. 2005. Im- proving parsing accuracy by combining diverse dependency parsers. In Proceedings of the Ninth International Workshop on Parsing Tech- nology. ACL, Vancouver, British Columbia, pages 171–178.