Parsing Italian texts together is better than parsing them alone!

                      Oronzo Antonelli                         Fabio Tamburini
               DISI, University of Bologna, Italy      FICLIT, University of Bologna, Italy
              antonelli.oronzo@gmail.it                fabio.tamburini@unibo.it


                            Abstract                      By looking at the cited papers we can observe
                                                       that they evaluated the state-of-the-art parsers be-
        English. In this paper we present a work       fore the “neural net revolution” not including the
        aimed at testing the most advanced, state-     last improvements proposed by new research stud-
        of-the-art syntactic parsers based on deep     ies.
        neural networks (DNN) on Italian. We              The goal of this paper is twofold: first, we
        made a set of experiments by using the         would like to test the effectiveness of parsers based
        Universal Dependencies benchmarks and          on the newly-proposed technologies, mainly deep
        propose a new solution based on ensem-         neural networks, on Italian, and, second, we would
        ble systems obtaining very good perfor-        like to propose an ensemble system able to further
        mances.                                        improve the neural parsers performances when
                                                       parsing Italian texts.
        Italiano. In questo contributo presentia-
        mo alcuni esperimenti volti a verificare       2   The Neural Parsers
        le prestazioni dei più avanzati parser
        sintattici sull’italiano utilizzando i tree-   We considered nine state of the art parsers repre-
        bank disponibili nell’ambito delle Univer-     senting a wide range of contemporary approaches
        sal Dependencies. Proponiamo inoltre un        to dependency parsing whose architectures are
        nuovo sistema basato sull’ensemble par-        based on neural network models (see Table 1). We
        sing che ha mostrato ottime prestazioni.       set-up each parser using the data from the Italian
                                                       Universal Dependencies (Nivre et al., 2016) tree-
                                                       bank, UD Italian 2.1 (general texts) and UD Italian
1       Introduction                                   PoSTWITA 2.2 (tweets). For all parsers, we used
                                                       the default settings for training, following the rec-
Syntactic parsing of morphologically rich lan-
                                                       ommendation of the developers.
guages like Italian often poses a number of hard
                                                          In Chen and Manning (2014) dense features are
challenges. Various works applied different kinds
                                                       used to learn representations of words, tags and
of freely available parsers on Italian training them
                                                       labels using a neural network classifier in order
using different resources and different methods for
                                                       to take parsing decisions within a transition-based
comparing their results (Lavelli, 2014; Alicante
                                                       greedy model. To address some limitations, in An-
et al., 2015; Lavelli, 2016) and gather a clear pic-
                                                       dor et al. (2016) the authors augmented the parser
ture of the syntactic parsing task performances for
                                                       model with a beam search and a conditional ran-
the Italian language. In this direction seems rel-
                                                       dom field loss objective. The work of Balles-
evant to cite the EVALITA1 periodic campaigns
                                                       teros et al. (2015) extends the parser defined in
for the evaluation of constituency and dependency
                                                       Dyer et al. (2015) introducing character-level rep-
parsers devoted to the syntactic analysis of Italian
                                                       resentation of words using bidirectional LSTMs
(Bosco and Mazzei, 2011; Bosco et al., 2014).
                                                       to improve the performance of stack-LSTM model
   Other studies regarding the syntactic parsing
                                                       which learn representations of the parser state.
of Italian tried to enhance the parsing perfor-
                                                       In Kiperwasser and Goldberg (2016) the bidirec-
mances by building some kind of ensemble sys-
                                                       tional LSTMs recurrent output vector for each
tems (Lavelli, 2013; Mazzei, 2015).
                                                       word is concatenated with any possible heads re-
    1
        http://www.evalita.it                          current vector, and the result is used as input to a
multi-layer perceptron (MLP) network that scores          After the influential paper from Reimers and
each resulting edge. Cheng et al. (2016) pro-          Gurevych (2017) it is clear to the community that
pose a bidirectional attention model which uses        reporting a single score for each DNN training ses-
two additional unidirectional RNN, called left-        sion could be heavily affected by the system ini-
right and right-left query component. Based on         tialisation point and we should instead report the
Kiperwasser and Goldberg (2016) and Cheng et al.       mean and standard deviation of various runs with
(2016) model, in Dozat and Manning (2017) a            the same setting in order to get a more accurate
biaffine attention mechanism is used, instead of       picture of the real systems performances and make
traditional MLP-based attention. The model pro-        more reliable comparisons between them.
posed in Nguyen et al. (2017) train a neural net-         Table 2 shows the parsers performances on
work model that learn jointly POS tagging and          the test set for the three setups described above
graph-based dependency parsing. The model uses         executing the training/validation/test cycle for 5
a bidirectional LSTM to learn POS tagging and the      times. In any setup the DM17 parser exhibits the
Kiperwasser and Goldberg (2016) approach for           best performances, notably very high for general
dependency parsing. Shi et al. (2017a,b) described     Italian. As we can expect, the performances on
a parser that combines three parsing paradigms us-     setup1 were much lower than that for setup0 due
ing a dynamic programming approach.                    to the intrinsic difficulties of parsing tweets and to
                                                       the scarcity of annotated tweets for training. Join-
  Parser Ref.-Abbreviation          Method Parsing
  (Chen and Manning, 2014) -        Tb: a-s Greedy     ing the two datasets in the setup2 allowed to get
    CM14                                               a relevant gain in parsing tweets even if we added
  (Ballesteros et al., 2015) -      Tb: a-s Be-se      out-of-domain data. For these reasons, for all the
    BA15
  (Kiperwasser and Goldberg, 2016)- Tb: a-h Greedy     following experiments, we abandoned the setup1
    KG16:T                                             because it seemed more relevant to use the joined
  (Kiperwasser and Goldberg, 2016)- Gb: a-f Eisner     data (setup2) and compare them to setup0.
    KG16:G
  (Andor et al., 2016) -            Tb: a-s Beam-S
    AN16                                               3    An Ensemble of Neural Parsers
  (Cheng et al., 2016) -            Gb: a-f cle
    CH16                                               The D EPENDA BLE tool in Choi et al. (2015) re-
  (Dozat and Manning, 2017) -       Gb: a-f cle        ports ensemble upper bound performance assum-
    DM17
  (Shi et al., 2017a,b)-            Tb: a-h./ Greedy   ing that, given the parsers outputs, the best tree
    SH17                            -eager             can be identified by an oracle “M ACRO” (M A), or
                                    Gb: a-f Eisner     that the best arc can be identified by another oracle
  (Nguyen et al., 2017) -           Gb: a-f Eisner
    NG17                                               “M ICRO” (mi). Table 3 shows that, by applying
                                                       these oracles, we have plenty of space for improv-
Table 1: All the neural parsers considered in          ing the performances by building some kind of en-
this study with their fundamental features as well     semble system able to cleverly choose the correct
as their abbreviations used throughout the paper.      information from the different parsers outputs and
In this table “Tb/Gb” means “Transition/Graph-         combine them improving the final solution. This
based”, “Beam-S” means “Beam-search” and “a-           observation motivates our proposal.
s/h/f” means “arc-standard/hybrid/factored”.              To combine the parser outputs we used the fol-
                                                       lowing ensemble schemas:
   We trained, validated and tested the nine con-
sidered parsers, as well as all the proposed exten-        • Voting: Each parser contributes by assigning
sions, by considering three different setups:                a vote on every dependency edge as described
                                                             in Zeman and Žabokrtský (2005). With the
  • setup0: only the UD Italian 2.1 dataset;
                                                             majority approach the dependency tree could
  • setup1: only the UD Italian PoSTWITA 2.2                 be ill-formed, in this case using the switching
    dataset;                                                 approach the tree is replaced with the output
                                                             of the first parser.
  • setup2: UD Italian 2.1 dataset joined with the
    UD Italian PoSTWITA 2.2 dataset (train and             • Reparsing: As described in Sagae and Lavie
    validation sets) keeping the test set of PoST-           (2006) together with Hall et al. (2007) a MST
    WITA 2.2;                                                algorithm is used to reparse a graph where
                       setup0                                 Chu-Liu/Edmons are used: equally weighted
               Valid. Ita             Test Ita                (w2); weighted according to the total la-
           UAS         LAS       UAS         LAS
 CM14 88.20/0.18 85.46/0.14 89.33/0.17 86.85/0.22             beled accuracy on the validation set (w3);
 BA15   91.15/0.11 88.55/0.23 91.57/0.38 89.15/0.33           weighted according to labeled accuracy per
 KG16:T 91.17/0.29 88.42/0.24 91.21/0.33 88.72/0.24           coarse grained PoS tag on the validation set
 KG16:G 91.85/0.27 89.23/0.31 92.04/0.18 89.65/0.10
 AN16   85.52/0.34 77.67/0.30 87.70/0.31 79.48/0.24           (w4).
 CH16   92.42/0.00 89.60/0.00 92.82/0.00 90.26/0.00
 DM17 93.37/0.27 91.37/0.24 93.72/0.14 91.84/0.18           • Distilling: In Kuncoro et al. (2016) the au-
 SH17   89.67/0.24 85.05/0.24 89.89/0.29 84.55/0.30           thors train a distillation parser using a loss
 NG17   90.37/0.12 87.19/0.21 90.67/0.15 87.58/0.11
                                                              objective with a cost that incorporates ensem-
                       setup1
                                                              ble uncertainty estimates for each possible at-
            Valid. PoSTW           Test PoSTW
           UAS         LAS       UAS         LAS              tachment.
 CM14 81.03/0.17 75.24/0.30 81.50/0.28 76.07/0.17
 BA15   83.44/0.20 77.70/0.25 84.06/0.38 78.64/0.44     4    Results
 KG16:T 77.38/0.14 68.81/0.25 77.41/0.43 69.13/0.43
 KG16:G 78.81/0.23 70.14/0.33 78.78/0.44 70.52/0.51     Tables 4, 7 and 9 show the performances of the en-
 AN16   77.74/0.25 66.63/0.16 77.78/0.33 67.21/0.30
 CH16   84.78/0.00 78.51/0.00 86.12/0.00 79.89/0.00
                                                        sembles built on the best results on validation set
 DM17 85.01/0.16 78.80/0.09 86.26/0.16 80.40/0.19       obtained in the 5 training/test cycles considering
 SH17   80.52/0.18 73.71/0.14 81.11/0.29 74.53/0.26     both setup0 and setup2. Table 6 reports the num-
 NG17   82.02/0.11 75.20/0.24 82.74/0.39 76.22/0.41
                                                        ber of malformed trees for the majority strategy.
                       setup2
          Valid. Ita+PoSTW         Test PoSTW
                                                           Table 5 and 8 report the number of cases when
           UAS         LAS       UAS         LAS        the ensemble combination output differs from the
 CM14 85.52/0.13 81.51/0.05 82.62/0.24 77.45/0.23       baseline, including both labeled (L) and unla-
 BA15   87.85/0.13 83.80/0.12 85.15/0.29 80.12/0.27
 KG16:T 83.89/0.23 77.77/0.26 80.47/0.36 72.92/0.46
                                                        beled (U) outputs. On the average the percent-
 KG16:G 84.70/0.14 78.41/0.14 81.41/0.37 73.49/0.19     age of different unlabeled output varies from 2%
 AN16   82.95/0.33 73.46/0.37 79.81/0.27 69.19/0.19     to 15% with respect to baseline. For the best result
 CH16   89.16/0.00 84.56/0.00 86.85/0.00 80.93/0.00
 DM17 89.72/0.10 85.85/0.13 87.22/0.24 81.65/0.21
                                                        (DM17+ALL) the difference on setup0 and setup2
 SH17   85.85/0.36 80.00/0.39 83.12/0.50 76.38/0.38     is about 4%.
 NG17   86.81/0.04 82.13/0.09 84.09/0.07 78.02/0.11        The results of the voting approach reported in
                                                        Table 4 shows that the majority strategy is slightly
Table 2: Mean/standard deviation of UAS/LAS for
                                                        better than the switching strategy, although it must
each parser and for the different setups by repeat-
                                                        be taken into account that there might be ill-
ing the experiments 5 times. All the results are sta-
                                                        formed dependency trees for the former strategy.
tistically significant (p < 0.05) and the best values
                                                        The percentage of ill-formed trees on valid./test
are showed in boldface.
                                                        set vary from a minimum of 2% to a maximum
            Validation           Test                   of 8%. For this reasons the majority strategy
                                                        should be used when it is followed by a man-
          UAS       LAS      UAS      LAS
                                                        ual correction phase. The switching strategy per-
                       setup0
                                                        forms well if the first parser of voters is one of the
 mi      98.30% 97.82% 98.08% 97.72%
                                                        best parsers, in fact the combinations AN16+ALL
 MA      96.62% 95.10% 96.31% 94.82%
                                                        and AN16+CM14+SH17 have worst performance
                       setup2
                                                        than the counterparts which using the best parser
 mi      97.08% 96.02% 96.32% 94.73%
                                                        (DM17) as the first voter. Overall, the highest
 MA      94.62% 91.29% 93.27% 88.50%
                                                        performance is achieved using all parsers together
Table 3: Results obtained by building an ensemble       with DM17 as the first voter. For setup0 the in-
system based on the oracles mi e M A and consid-        creases are +0.19% in UAS e +0.38% in LAS,
ering all parsers.                                      while in setup2 are +0.92% in UAS e +2.47% in
                                                        LAS with respect to the best single parser (again
                                                        DM17).
     each word in the sentence is a node. The              The results of the reparsing approach reported
     MSTs algorithms used are Chu-Liu/Edmons            in Table 7 shows that the Chu-Liu/Edmonds al-
     (cle) and Eisner as reported in McDonald           gorithm is slightly better than the Eisner algo-
     et al. (2005). Three weighting strategies for      rithm. In this case, the choice of which strategy
                    setup0                                                 setup0
                       Validation      Test                                     Validation     Test
Voters/Strategy       UAS     LAS  UAS     LAS                                   /11.908     /10.417
DM17+CH16+BA15/maj. 94.20% 92.27% 93.77% 92.13%            Voters/Strategy       U      L    U      L
DM17+CH16+BA15/swi. 94.11% 92.16% 93.79% 92.14%            DM17+CH16+BA15/maj. 208 61 188 46
AN16+CM14+SH17/maj. 90.43% 87.96% 91.03% 88.47%            DM17+CH16+BA15/swi. 192 52 175 39
AN16+CM14+SH17/swi. 89.44% 86.77% 90.17% 87.43%            AN16+CM14+SH17/maj. 1.006 424 783 336
DM17+CM14+SH17/maj. 93.84% 92.03% 93.82% 92.27%            AN16+CM14+SH17/swi. 1.130 489 870 371
DM17+CM14+SH17/swi. 93.76% 91.94% 93.82% 92.25%            DM17+CM14+SH17/maj. 170 37 139 15
AN16+ALL/maj.       94.37% 92.65% 93.83% 92.27%            DM17+CM14+SH17/swi. 157 33 129 13
AN16+ALL/swi.       93.99% 92.15% 93.43% 91.73%            AN16+ALL/maj.        382 126 328 105
DM17+ALL/maj.       94.42% 92.67% 93.94% 92.41%            AN16+ALL/swi.        460 164 386 133
DM17+ALL/swi.       94.38% 92.60% 93.91% 92.37%            DM17+ALL/maj.        356 117 282 81
DM17 (baseline)     93.74% 91.66% 93.75% 92.03%            DM17+ALL/swi.        312 97 255 72
                    setup2                                                 setup2
                       Validation      Test                                     Validation     Test
Voters/Strategy       UAS     LAS  UAS     LAS                                   /24.243     /12.668
DM17+CH16+BA15/maj. 90.57% 87.16% 88.21% 83.64%            Voters/Strategy       U      L    U      L
DM17+CH16+BA15/swi. 90.51% 87.10% 88.13% 83.51%            DM17+CH16+BA15/maj. 597 219 470 213
AN16+CM14+SH17/maj. 86.90% 83.60% 84.09% 79.78%            DM17+CH16+BA15/swi. 521 185 394 172
AN16+CM14+SH17/swi. 86.01% 82.50% 82.58% 77.94%            AN16+CM14+SH17/maj. 2.757 1.329 1.805 941
DM17+CM14+SH17/maj. 90.35% 87.21% 88.07% 83.64%            AN16+CM14+SH17/swi. 2.976 1.429 1.986 1.033
DM17+CM14+SH17/swi. 90.27% 87.11% 87.99% 83.52%            DM17+CM14+SH17/maj. 490 140 337 93
AN16+ALL/maj.       90.30% 87.26% 88.36% 84.13%            DM17+CM14+SH17/swi. 453 121 300 73
AN16+ALL/swi.       89.70% 86.45% 87.46% 83.06%            AN16+ALL/maj.       1.377 624 897 440
DM17+ALL/maj.       90.64% 87.60% 88.51% 84.42%            AN16+ALL/swi.       1.610 741 1.063 534
DM17+ALL/swi.       90.65% 87.62% 88.50% 84.20%            DM17+ALL/maj.       1.156 502 784 378
DM17 (baseline)     89.82% 85.96% 87.59% 81.95%            DM17+ALL/swi.        920 374 614 280

Table 4: Results of ensembles using switching and       Table 5: Numbers of cases when there is a dif-
majority approaches on the best models in setup0        ferent output between the ensemble systems, us-
and setup2. The baseline is defined by the best         ing switching and majority, and the baseline Dozat
results of Dozat and Manning (2017).                    and Manning (2017).
                                                                                  setup0         setup2
to use must take into account if we want to allow        Voters               Valid. Test     Valid. Test
                                                                               /564    /482   /1235 /674
non-projectivity or not. The percentage of non-          DM17+CH16+BA15          9       7      31      31
projective dependency trees on valid./test set for       AN16+CM14+SH17         45       25     88      77
Chu-Liu/Edmonds vary from a minimum of 7% to             DM17+CM14+SH17          6       6      19      23
                                                         AN16+ALL               18       17     73      63
a maximum of 12% compared with the average for           DM17+ALL               17       11     75      57
the Italian corpora of 4%. Overall, the highest per-
formances are achieved using Chu-Liu/Edmonds            Table 6: Number of malformed trees obtained by
algorithm. For setup0 the increases are +0.25%          using the majority strategy for both setups.
in UAS and +0.45% in LAS, while in setup2 are
+0.77% in UAS and +2.30% in LAS with respect               Thanks to the number of parser models adopted
to the best single parser (DM17).                       in the experiments it has been possible to verify
   The results of the distilling strategy reported in   that the performances of the ensemble models in-
Table 9, unlike the previous proposals, show worse      crease as the number of parsers grows.
outcomes, which score below the baseline.                  The improvement of LAS is, in most cases, at
                                                        least twice the value of UAS. This could mean
5   Discussion and Conclusions
                                                        that ensemble models catch with better precision
We have studied the performances of some neu-           the type of dependency relations rather than head-
ral dependency parsers on generic and social me-        dependent relations.
dia domain. Using the predictions of each single           All the proposed ensemble strategies, except for
parser we combined the best outcomes to improve         distilling, perform more or less in the same way,
the performance in various ways. The ensemble           therefore the choice of which strategy to use is
models are more efficient on corpora built using        due, in part, to the properties that we want to ob-
in-domain data (social media), giving an improve-       tain on the combined dependency tree.
ment of ∼ 1% in UAS and ∼ 2.5% in LAS.                     Our work is inspired by the work of Mazzei
                     setup0                             Setup         UAS                  LAS
                         Validation       Test          setup0   92.50% (–1.25%)      89.93% (–2.10%)
Voters/Strategy         UAS LAS UAS LAS
DM17+CH16+BA15/cle-w2 93.82% 91.85% 93.54% 91.83%
                                                        setup2   86.73% (–0.86%)      81.39% (–0.56%)
DM17+CH16+BA15/cle-w3 93.89% 91.82% 93.78% 92.06%
DM17+CH16+BA15/cle-w4 94.20% 92.28% 93.72% 92.04%      Table 9: Results of distilling approach on the best
DM17+CH16+BA15/eisner 94.05% 92.05% 93.46% 91.78%      models in setup0 and setup2. In brackets are re-
ALL/cle-w2             94.31% 92.53% 93.85% 92.23%
ALL/cle-w3             94.16% 92.41% 94.00% 92.48%
                                                       ported the differences between the distilled mod-
ALL/cle-w4             94.29% 92.58% 93.95% 92.38%     els and the best results of DM17, as baseline.
ALL/eisner             94.31% 92.53% 93.95% 92.35%
DM17 (baseline)        93.74% 91.66% 93.75% 92.03%
                     setup2                            the models used in the ensembles; furthermore we
                         Validation       Test         have experimented the distilling strategy and eis-
Voters/Strategy         UAS LAS UAS LAS                ner reparsing algorithm. Moreover, we built en-
DM17+CH16+BA15/cle-w2 90.33% 86.95% 87.69% 83.31%
DM17+CH16+BA15/cle-w3 89.82% 85.96% 87.59% 81.95%      sembles on larger datasets using both generic and
DM17+CH16+BA15/cle-w4 90.41% 86.99% 87.94% 83.32%      social media texts.
DM17+CH16+BA15/eisner 90.50% 87.05% 88.04% 83.51%
ALL/cle-w2             90.52% 87.53% 88.36% 84.25%
ALL/cle-w3             89.90% 86.75% 87.79% 83.54%
                                                       Acknowledgements
ALL/cle-w4             90.42% 87.46% 88.19% 84.11%
ALL/eisner             90.45% 87.41% 88.31% 84.08%     We gratefully acknowledge the support of
DM17 (baseline)        89.82% 85.96% 87.59% 81.95%     NVIDIA Corporation with the donation of the Ti-
                                                       tan Xp GPU used for this research.
Table 7: Results of ensembles using reparsing ap-
proaches on the best models in setup0 and setup2.      References
The baseline is again defined by the best results of
DM17.                                                  Anita Alicante, Cristina Bosco, Anna Corazza,
                                                         and Alberto Lavelli. 2015. Evaluating italian
                       setup0                            parsing across syntactic formalisms and anno-
                          Validation   Test              tation schemes. In Roberto Basili, Cristina
                            /11.908  /10.417             Bosco, Rodolfo Delmonte, Alessandro Mos-
   Voters/Strategy        UAS LAS UAS LAS
   DM17+CH16+BA15/cle-w2 360 129 307 90                  chitti, and Maria Simi, editors, Harmonization
   DM17+CH16+BA15/cle-w3 96       0  89     1            and Development of Resources and Tools for
   DM17+CH16+BA15/cle-w4 267 76 247 52                   Italian Natural Language Processing within the
   DM17+CH16+BA15/eisner 375 130 327 103
   ALL/cle-w2              400 131 333 103               PARLI Project, Springer International Publish-
   ALL/cle-w3              351 108 299 79                ing, Cham, pages 135–159.
   ALL/cle-w4              383 126 307 87
   ALL/eisner              411 133 333 106             Daniel Andor, Chris Alberti, David Weiss, Ali-
                   setup2                                aksei Severyn, Alessandro Presta, Kuzman
                          Validation   Test              Ganchev, Slav Petrov, and Michael Collins.
                            /24.243  /12.668
                                                         2016. Globally normalized transition-based
   Voters/Strategy        UAS LAS UAS LAS
   DM17+CH16+BA15/cle-w2 1.056 496 800 424               neural networks. In Proceedings of the 54th
   DM17+CH16+BA15/cle-w3 0        0   0     0            Annual Meeting of the Association for Compu-
   DM17+CH16+BA15/cle-w4 603 264 491 236
   DM17+CH16+BA15/eisner 1.047 443 789 376
                                                         tational Linguistics (Volume 1: Long Papers).
   ALL/cle-w2             1.347 599 882 417              ACL, Berlin, Germany, pages 2442–2452.
   ALL/cle-w3             1.261 537 804 363
   ALL/cle-w4             1.274 576 822 389            Miguel Ballesteros, Chris Dyer, and Noah A.
   ALL/eisner             1.367 607 916 436             Smith. 2015. Improved transition-based parsing
                                                        by modeling characters instead of words with
Table 8: Numbers of cases when there is a differ-       lstms. In Proceedings of the 2015 Conference
ent output between the ensemble systems, using          on Empirical Methods in Natural Language
reparsing approaches, and the baseline Dozat and        Processing. ACL, Lisbon, Portugal, pages 349–
Manning (2017).                                         359.
                                                       Cristina Bosco, Felice DellOrletta, Simonetta
(2015). Different from his work, we use larger           Montemagni, Manuela Sanguinetti, and Maria
set of state-of-the-art parsers, all based on neural     Simi. 2014. The evalita 2014 dependency pars-
networks, in order to gain more diversity among          ing task. In Proceedings of the Fourth Inter-
  national Workshop EVALITA 2014. Pisa, Italy,         Transactions of the Association for Computa-
  pages 1–8.                                           tional Linguistics 4:313–327.
Cristina Bosco and Alessandro Mazzei. 2011. The      Adhiguna Kuncoro, Miguel Ballesteros, Lingpeng
  evalita 2011 parsing task. In Working Notes of       Kong, Chris Dyer, and Noah A. Smith. 2016.
  EVALITA 2011, CELCT, Povo, Trento.                   Distilling an ensemble of greedy dependency
                                                       parsers into one mst parser. In Proceedings
Danqi Chen and Christopher Manning. 2014. A
                                                       of the 2016 Conference on Empirical Methods
  fast and accurate dependency parser using neu-
                                                       in Natural Language Processing. ACL, Austin,
  ral networks. In Proceedings of the 2014
                                                       Texas, pages 1744–1753.
  Conference on Empirical Methods in Natural
  Language Processing (EMNLP). ACL, Doha,            Alberto Lavelli. 2013. An ensemble model for
  Qatar, pages 740–750.                                the evalita 2011 dependency parsing task. In
                                                       Bernardo Magnini, Francesco Cutugno, Mauro
Hao Cheng, Hao Fang, Xiaodong He, Jianfeng
                                                       Falcone, and Emanuele Pianta, editors, Eval-
  Gao, and Li Deng. 2016. Bi-directional atten-
                                                       uation of Natural Language and Speech Tools
  tion with agreement for dependency parsing. In
                                                       for Italian. Springer Berlin Heidelberg, Berlin,
  Proceedings of the 2016 Conference on Empir-
                                                       Heidelberg, pages 30–36.
  ical Methods in Natural Language Processing.
  ACL, Austin, Texas, pages 2204–2214.               Alberto Lavelli. 2014. Comparing state-of-the-
                                                       art dependency parsers for the evalita 2014 de-
Jinho D. Choi, Joel Tetreault, and Amanda Stent.       pendency parsing task. In Proceedings of the
   2015. It depends: Dependency parser compari-        Fourth International Workshop EVALITA 2014.
   son using a web-based evaluation tool. In Pro-      Pisa, Italy, pages 15–20.
   ceedings of the 53rd Annual Meeting of the As-
   sociation for Computational Linguistics and the   Alberto Lavelli. 2016. Comparing state-of-the-art
   7th International Joint Conference on Natural       dependency parsers on the italian stanford de-
   Language Processing (Volume 1: Long Papers).        pendency treebank. In Proceedings of the Third
   ACL, Beijing, China, pages 387–396.                 Italian Conference on Computational Linguis-
                                                       tics (CLiC-it 2016). Napoli, Italy, pages 173–
Timothy Dozat and Christopher D. Manning.              178.
  2017. Deep biaffine attention for neural depen-
                                                     Alessandro Mazzei. 2015. Simple voting algo-
  dency parsing. In Proceedings of the 2017 In-
                                                       rithms for italian parsing. In Roberto Basili,
  ternational Conference on Learning Represen-
                                                       Cristina Bosco, Rodolfo Delmonte, Alessandro
  tations.
                                                       Moschitti, and Maria Simi, editors, Harmoniza-
Chris Dyer, Miguel Ballesteros, Wang Ling,             tion and Development of Resources and Tools
  Austin Matthews, and Noah A. Smith. 2015.            for Italian Natural Language Processing within
  Transition-based dependency parsing with stack       the PARLI Project, Springer International Pub-
  long short-term memory. In Proceedings of            lishing, Cham, pages 161–171.
  the 53rd Annual Meeting of the Association for
                                                     Ryan McDonald, Fernando Pereira, Kiril Ribarov,
  Computational Linguistics and the 7th Interna-
                                                       and Jan Hajic. 2005. Non-projective depen-
  tional Joint Conference on Natural Language
                                                       dency parsing using spanning tree algorithms.
  Processing (Volume 1: Long Papers). ACL,
                                                       In Proceedings of Human Language Technol-
  Beijing, China, pages 334–343.
                                                       ogy Conference and Conference on Empiri-
Johan Hall, Jens Nilsson, Joakim Nivre, Gülsen        cal Methods in Natural Language Processing.
  Eryigit, Beáta Megyesi, Mattias Nilsson, and        ACL, Vancouver, British Columbia, Canada,
  Markus Saers. 2007. Single malt or blended?          pages 523–530.
  a study in multilingual parser optimization. In    Dat Quoc Nguyen, Mark Dras, and Mark John-
  Proceedings of the CoNLL Shared Task Session         son. 2017. A novel neural network model for
  of EMNLP-CoNLL 2007. ACL, Prague, Czech              joint pos tagging and graph-based dependency
  Republic, pages 933–939.                             parsing. In Proceedings of the CoNLL 2017
Eliyahu Kiperwasser and Yoav Goldberg. 2016.           Shared Task: Multilingual Parsing from Raw
  Simple and accurate dependency parsing us-           Text to Universal Dependencies. ACL, Vancou-
  ing bidirectional lstm feature representations.      ver, Canada, pages 134–142.
Joakim Nivre, Marie-Catherine de Marneffe, Filip
  Ginter, Yoav Goldberg, Jan Hajic, Christo-
  pher D. Manning, Ryan McDonald, Slav Petrov,
  Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty,
  and Daniel Zeman. 2016. Universal dependen-
  cies v1: A multilingual treebank collection. In
  Proceedings of the Tenth International Confer-
  ence on Language Resources and Evaluation
  (LREC 2016).
Nils Reimers and Iryna Gurevych. 2017. Re-
  porting score distributions makes a difference:
  Performance study of lstm-networks for se-
  quence tagging. In Proceedings of the 2017
  Conference on Empirical Methods in Natural
  Language Processing. ACL, Copenhagen, Den-
  mark, pages 338–348.
Kenji Sagae and Alon Lavie. 2006. Parser com-
  bination by reparsing. In Proceedings of the
  Human Language Technology Conference of
  the NAACL, Companion Volume: Short Papers.
  ACL, Stroudsburg, PA, USA, NAACL-Short
  ’06, pages 129–132.
Tianze Shi, Liang Huang, and Lillian Lee. 2017a.
  Fast(er) exact decoding and global training for
  transition-based dependency parsing via a min-
  imal feature set. In Proceedings of the 2017
  Conference on Empirical Methods in Natural
  Language Processing. ACL, Copenhagen, Den-
  mark, pages 12–23.
Tianze Shi, Felix G. Wu, Xilun Chen, and Yao
  Cheng. 2017b. Combining global models for
  parsing universal dependencies. In Proceedings
  of the CoNLL 2017 Shared Task: Multilingual
  Parsing from Raw Text to Universal Dependen-
  cies. ACL, Vancouver, Canada, pages 31–39.
Daniel Zeman and Zdeněk Žabokrtský. 2005. Im-
  proving parsing accuracy by combining diverse
  dependency parsers. In Proceedings of the
  Ninth International Workshop on Parsing Tech-
  nology. ACL, Vancouver, British Columbia,
  pages 171–178.