Old School vs. New School:
    Comparing Transition-Based Parsers with and
      without Neural Network Enhancement
                  Miryam de Lhoneux, Sara Stymne and Joakim Nivre

                   Department of Linguistics and Philology
                            Uppsala University
 E-mail: {miryam.de_lhoneux,sara.stymne,joakim.nivre}@lingfil.uu.se


                                        Abstract
      In this paper, we attempt a comparison between "new school" transition-
      based parsers that use neural networks and their classical "old school" coun-
      terpart. We carry out experiments on treebanks from the Universal Depen-
      dencies project. To facilitate the comparison and analysis of results, we only
      work on a subset of those treebanks. However, we carefully select this sub-
      set in the hope to have results that are representative for the whole set of
      treebanks. We select two parsers that are hopefully representative of the two
      schools; MaltParser and UDPipe and we look at the impact of training size
      on the two models. We hypothesize that neural network enhanced models
      have a steeper learning curve with increased training size. We observe, how-
      ever, that, contrary to expectations, neural network enhanced models need
      only a small amount of training data to outperform the classical models but
      the learning curves of both models increase at a similar pace after that. We
      carry out an error analysis on the development sets parsed by the two sys-
      tems and observe that overall MaltParser suffers more than UDPipe from
      longer dependencies. We observe that MaltParser is only marginally better
      than UDPipe on a restricted set of short dependencies.


1    Introduction
Treebanks have recently been released for a large number of languages in a consis-
tent annotation within the framework of the Universal Dependencies (UD) project
[14]. In line with work on dependency parsing following the CoNLL shared task
from 2006 [3], and in contrast with most work on constituency parsing which has
focused on one language and one domain (the English Penn Treebank), this project
may help reshape the field of syntactic parsing by using a wide variety of languages
and domains. Simultaneously to the development of this project, syntactic parsing
has seen a significant boost in accuracy in the last couple of years with methods


                                           99
that make use of neural networks to learn dense vectors for words, POS tags and
dependency relations [4, 18, 1] or stacks [5].
    Motivated by the success of neural network models on the PTB and the Chi-
nese treebank, Straka et al. [17] trained Parsito, a neural network model, on UD
treebanks and obtained good results, improving over their ‘classical’ counterpart
MaltParser [15]. There was, however, no systematic comparison between the clas-
sical and the neural network approach and it may be that one approach is more
suitable than another for specific settings. Moreover, the results they reported for
MaltParser are obtained using default settings but results can be much higher with
optimised settings.
    In this paper, we propose to compare the performance of these two types of
parsers on UD treebanks. We first propose to select a sample of the UD treebanks
to ease the comparison between parsing models in general.


2     Treebank Sampling for Comparative Parser Evaluation
Having a large and varied data set has the advantage that our parsing models will
generalize better. A disadvantage is that it is expensive to train models for all the
languages, especially with the new neural network models that need a search over a
large hyperparameter space in order to be optimised. As UD grows, it may become
more and more prohibitive to train models for all the languages when we want to
evaluate how a parser does as opposed to another or as opposed to a modified ver-
sion of itself.
    We therefore suggest that it might be wise to examine their behavior in a small-
scaled setting before training them for the large number of treebanks in UD. Parsing
models can first be evaluated on a small sample of UD treebanks. Subsequently,
depending on the observations on the small set, we can move to a medium sample
before finally testing on all the treebanks if evidence points towards a clear direc-
tion. We have come up with a set of criteria to select the small sample which we
now turn to.
    The objective was to have a sample as representative of the whole treebank set
as possible. To ensure typological variety we divided UD languages into coarse-
grained and fine-grained language families. This led to a total of 15 different
fine-grained families and 8 coarse-grained. We made it a requirement to not se-
lect two languages from the same fine-grained family and ensured to have some
variety in coarse-grained families. We made sure to have at least one isolating,
one morphologically-rich and one inflecting language. We additionally ensured a
variability of treebank sizes and domains. Since parsing non-projective trees is no-
toriously harder than parsing projective trees, we also made sure to have at least
one treebank with a large amount of non-projective trees. The quality of treebanks
was also considered in the selection, in particular, there are known issues1 about
inconsistency in the annotation. We selected languages that had as little of those
    1 http://universaldependencies.org/svalidation.html


                                                 100
                             Table 1: Treebank Sample
                         coarse          fine           main argument for inclusion
 Czech                   indo-european   balto-slavic   the largest UD treebank
 Chinese                 sino-tibetan    sinitic        the only isolating language (without copyrights)
 Finnish                 uralo-altaic    finno-ungric   morphologically rich; has many different domains
 English                 indo-european   germanic       largest treebank with full manual check of the data
 Ancient_Greek-PROIEL    indo-european   hellenic       large percentage of non-projective trees
 Kazakh                  uralo-altaic    turkic         smallest treebank; full manual check of the annotations
 Tamil                   dravidian       tamil          small treebank; language family considerations
 Hebrew                  afro-asiatic    semitic        morphologically rich/language family considerations


as possible. To ensure comparability, we finally made sure to select only treebanks
with morphological features (with one exception for Kazakh). This resulted in
a selection of 8 treebanks. The selection is given in Table 1 together with main
arguments for inclusion for each.


3    Comparing Parsing Accuracy
As said in the introduction, a systematic comparison of classical models for transition-
based parsing as opposed to models helped by neural network training (henceforth
NN parsers) is lacking. With our selection of treebanks just presented, we will now
compare those two model types. More specifically, we compare MaltParser with
Parsito. As was also said, when compared with NN parsers, MaltParser results
were reported using default settings but results can be much better with optimised
settings. For this reason, we optimised models with the help of MaltOptimizer [2].
In an attempt to keep the models somewhat similar across languages, we chose
to use the same parsing algorithm for all the languages and hence only optimised
options specific to the data set (like the root label for example) (MaltOptimizer’s
phase 1) and the feature model (MaltOptimizer’s phase 3). We selected the arc-
standard swap system with a lazy oracle for its ability to deal with non-projectivity
which was crucial at least for Ancient Greek. Additionally, it was one of the popu-
lar algorithms suggested by MaltOptimizer for our selection of languages, together
with its projective version.
     For Parsito, we used the pretrained models that the authors made available.
They are trained on UD version 1.2 but we tested them on version 1.3 since that
is the version used for the other parsers. We additionally optimised models for the
languages for which there was no pretrained model available as well as for Tamil
because the results were too low on version 1.3 as compared to version 1.2 (prob-
ably due to significant differences in the 2 versions). In order to optimise those
models, we first made sure that we could reproduce results from the pretrained
models on one language (Hebrew): we experimented with the transition system
and oracle and used the random hyperparameter search provided by UDPipe to
tune the hyperparameters.
     We add the best reported results for SyntaxNet [1] because they are currently on


                                         101
       Table 2: Parsing models comparison (LAS, excluding punctuation).
       language                  tokens     UDPipe Malt SyntaxNet
       Ancient_Greek-PROIEL 206K            68.3     67.8 73.15
       Chinese                   123K       68.9     67.7 71.24
       Czech                     1,503K 82.6         80.3 85.93
       English                   254K       81.3     79.9 80.38
       Finnish                   181K       75.7     74.2 79.60
       Hebrew                    115K       77.2     76.2 78.71
       Kazakh                    4K         41.2     44.4 43.95
       Tamil                     8K         58.1     58.8 55.35


average the best reported results for UD. Parsito and SyntaxNet are both transition-
based parsers that use neural networks to learn vector representations of words,
POS tags and dependency relations in a similar way to Chen and Manning [4].
Parsito adds a search-based oracle and SyntaxNet adds beam search and global
normalization.
     Parsito has been integrated into the recent UDPipe [16] parsing pipeline that
performs tokenisation, morphological analysis and POS tagging. Beam search was
added to Parsito in the version used in UDPipe. We will refer to that parser as the
UDPipe parser in the remainder of this paper. UDPipe taggers and morphological
analysers were trained for all treebanks and both MaltParser and the UDPipe parser
were tested on the test sets that used those models to predict POS tags (universal
and language specific) as well as morphological features. Note that SyntaxNet
results are not directly comparable as they use their own POS tagger and morpho-
logical analyser.
     Labeled Attachment Scores (LAS) are given in Table 2. As appears from the
table, UDPipe largely outperforms MaltParser. However, MaltParser performs bet-
ter than UDPipe on very small data sets. SyntaxNet is even better than UDPipe in
most cases but is also outperformed by MaltParser on very small treebanks.


4    Impact of Training Size on Neural Network Parsing
Looking at these results, we can hypothesize that the superior performance of the
NN parsers depend on having reasonably large training sets. We hypothesize that
neural network parsers improve more steeply with increased training size than
MaltParser does. We tested this hypothesis with a learning curve experiment. We
trained several parsers for our selection of languages, varying the size of the train-
ing data. In order to prevent sequential effects that may come from the structure
of treebanks from impacting the results, we randomly shuffled the sentences of the
training set before splitting it to different sample sizes. We compared the effect
of doing that using MaltParser and UDPipe. We first split the training data sets
into one sample of 1K and samples of 50K. Additionally, to zoom in on what hap-


                                         102
                       Figure 1: Learning curve experiment


pens with very small data sizes, we also ran the experiment with splits of 1K from
1K to 15K. Because it would have been unreasonable to optimise models for each
split size, we used unoptimised version of both MaltParser and UDPipe. For Malt-
Parser, we used the arc-standard swap algorithm again with a lazy oracle and an
extended feature model that uses morphological features, in the same way as was
done by Nivre [13]. The use of morphological features is important so as to have
a model that is comparable to UDPipe but also because morphological features are
crucial for parsing morphologically rich languages like Finnish and Hebrew. For
UDPipe, we also used the swap transition system as well as a lazy oracle so that we
have a comparable system for both parsers. We used UDPipe’s default hyperpa-
rameters which they report to be the hyperparameters that worked best across their
experiments on UD treebanks2 .
    As can be seen in Figures 1 and 2, the hypothesis that neural network parsers
increase more steeply than MaltParser with increased training size seems to hold
only to some extent. The learning curve for UDPipe is very steep for only the first
few thousands of tokens but flattens out quite quickly after that and continues im-
proving at a similar pace as MaltParser. It is interesting to note also that MaltParser
does not even outperform UDPipe on all treebanks with a training size of 1K.
  2 http://ufal.mff.cuni.cz/udpipe/users-manual


                                         103
        Figure 2: Learning curve experiment: zoom on tiny training sizes


5     Error Analysis
5.1   Error Analysis Approaches
Since training size does not seem to be the only factor explaining the variation in
results in the comparison of the parsers, we want to gain more insights into the
strengths and weaknesses of each parser.
     There are many ways of comparing different parsers which, as described in Kir-
ilin and Versley [7], can be placed on a coarse-grained to fine-grained scale where
the coarsest level just consists in comparing the attachment scores and the finest
level consists in manually looking at output parse trees of both systems. There are
many other possibilities in between these two levels. McDonald and Nivre [11]
look at the impact of different properties of trees on accuracy of a transition-based
and a graph-based parser. Goldberg and Elhadad [6] follow up on this and look
more in detail at the over- and underproductions of specific constructions of those
two systems, showing that it is possible to train a classifier that predicts which sys-
tem was used to parse some output data. Kummerfeld et al. [8] also look at some
fine-grained phenomena such as PP and NP attachment and compare the behaviour
of many constituency parsers on those phenomena. Kirilin and Versley [7] also
look at fine-grained error patterns made by different parsers on UD treebanks.
     There has not been much work on characterizing the errors of neural network
parsers, at least the feedforward neural network ones. Nguyen et al. [12] compared
the performance of two graph-based and two transition-based parsers, each pair of
which contains a neural network parser, on the Vietnamese treebank. The parsers
they used, however, use Recurrent Neural Network (RNN) models. Such a study
has not been done for feedforward neural network parsers, as far as we are aware.


                                         104
Similarly, recent work has started investigating what neural network parsers learn
[10, 9], but these again use RNNs. For this reason, the approach by McDonald
and Nivre [11] seems particularly suited here to get an overview of the strengths
and weaknesses of feedforward neural network parsers compared to their classical
counterpart. It would be interesting to follow up on this study by looking at more
fine-grained phenomena.

5.2   MaltParser vs UDPipe
As just mentioned, McDonald and Nivre [11] conducted an extensive error analy-
sis on two parsers, MaltParer and MSTParser, coming from two different parsing
frameworks: the transition-based and the graph-based parsing framework respec-
tively. They analyse the effect of different properties of dependency trees on accu-
racy which they divide into two different types: graph and linguistics factors. For
the first type, they look at the length of dependencies, length of the sentence, and
a few other things. For the second type, they divide accuracy across different POS
tags and dependency relations. This allows them to observe a tradeoff between
the rich representation of MaltParser that allows it to do well on frequent depen-
dencies and the problem of error propagation from which MaltParser suffers more
than MSTParser. They argue that these results can be explained by the properties
of the two parsers and the tradeoff between rich representation combined with local
greedy inference and less rich representation combined with global exact inference.
We attempted to carry out a similar study to compare the two parsers that we are
investigating in this paper.
    McDonald and Nivre [11] worked on the CoNLL shared-task data [3], that is,
data from 13 different languages. They concatenated the test sets parsed by the two
systems they compared and conducted their analysis on that concatenated test set.
They were able to do that because the test sets all had a very similar size. In this
study, we are working on the development sets. We could also have concatenated
those development sets but their sizes vary by orders of magnitude (∼500 tokens to
∼150K tokens) so we unfortunately cannot do this. We could theoretically create
a balanced test set using only as many tokens as there are in the smallest devel-
opment set. The 2 smallest development sets are, however, very small (∼500 and
∼1200 tokens), which means that doing this would make the data set very small
and sparse. In order to reach a compromise between a perfectly balanced small data
set and a very unbalanced big data set, we concatenated a portion of each treebank
of the size of the smallest data size of the development sets excluding Tamil and
Kazakh (which is Finnish and has ∼9K tokens). We added the full development
set of Tamil and Kazakh but we keep in mind the fact that they have a small impact
on the results.


                                        105
                Figure 3: Accuracy relative to dependency Length


Graph Factors
Our overall results are not as clear-cut as they were in McDonald and Nivre [11]
but we observe somewhat similar tendencies: the advantages of MSTParser over
MaltParser seem to carry over to UDPipe to some extent. For example, as can be
seen in Figure 3, MaltParser seems to suffer from dependency length more than
UDPipe does. Note that we use the F1 harmonic mean of precision and recall here
instead of LAS. This is because we cannot assume a one-to-one correspondence
between the predicted and gold dependencies.
    The picture is not so clear for sentence length as can be seen from Figure 4
because although accuracy for MaltParser seems to be decreasing more than for
UDPipe between 1-10 and 20-30 token sentences, they both perform similarly on
sentences between 40 and 50 tokens and UDPipe is then again better on sentences
longer than 50 tokens. Note however that the max length of sentences for Kazakh
and Tamil are 27 and 49 respectively which might provide some explanation of
why MaltParser outperforms UDPipe for those.
    It is interesting to point out that Nguyen et al. [12] reported a similar tendency
between the results of RNN graph-based and transition-based parsers and their
classical counterpart, where the RNN ones suffer less from sentence and depen-
dency length than the classical ones.

Linguistic Factors
McDonald and Nivre [11] created a taxonomy of POS tags and dependency rela-
tions so as to group similar ones together and have consistent labels across tree-
banks. Luckily, in our case, we already have consistent dependency labels and
POS tags. In Figure 5 we give accuracies (in terms of F1 again for the same reason
as before) for the 15 most frequent dependency relations and in Figure 6 we give


                                         106
                 Figure 4: Accuracy relative to sentence Length


accuracies for the POS tags. The picture seems somewhat consistent with what we
have observed so far: UDPipe is better than MaltParser on dependencies that may
be distant such as conj and advcl. MaltParser is better, although not by far, for
dependencies that are expected to be short such as nummod.


              Figure 5: Accuracy for different dependency relations

    The accuracies for POS tags show a similar picture as dependency relations.
UDPipe outperforms MaltParser on most POS tags except the ones that are ex-
pected to be close to their head such as NUM.
    Overall then, UDPipe outperforms MaltParser on most dependencies and tags
which shows that in general, their representation is better. MaltParser outperforms


                                        107
                    Figure 6: Accuracy for different POS tags


UDPipe only by a small margin on a limited number of phenomena which all seem
to involve short dependencies.
     It is important to note however that an important factor at play here is the
beam search used by UDPipe. It has been shown that beam search helps with long
dependencies [19]. It would be interesting to isolate the beam search factor from
the neural network classifier one by comparing UDPipe with and without it.


6    Conclusion and Future Work
In this paper, we presented a comparison of the performance of neural network
parsers with their classical counterpart. We saw that on small treebanks, MaltParser
outperforms UDPipe. We investigated the effect of training size on both types of
parsers and observed that UDPipe is better than MaltParser even on tiny data sizes.
We observed that UDPipe has a steep learning curve with very small data sizes but
flattens out to a smaller increase rate in a similar way as MaltParser which stays
only a few percentages below with increased training sizes in most cases. We car-
ried out an error analysis and observed that MaltParser suffers more from increased
dependency length than UDPipe does. MaltParser only does marginally better on a
restricted set of short dependencies. Doing more fine-grained error analysis could
lead to a more in-depth understanding of the strengths and weaknesses of the two
models and it would be interesting to investigate the effect of beam search on neural
network models.


                                         108
References
[1] Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro
   Presta, Kuzman Ganchev, Slav Petrov, and Michael Collins. 2016. Globally
   normalized transition-based neural networks. In Proceedings of the 54th Annual
   Meeting of the Association for Computational Linguistics, ACL 2016, August 7-
   12, 2016, Berlin, Germany, Volume 1: Long Papers.

[2] Miguel Ballesteros and Joakim Nivre. 2016. MaltOptimizer: Fast and effective
   parser optimization. Natural Language Engineering 22(2):187–213.

[3] Sabine Buchholz and Erwin Marsi. 2006. Conll-x shared task on multilingual
   dependency parsing. In Proceedings of the Tenth Conference on Computational
   Natural Language Learning. Association for Computational Linguistics, pages
   149–164.

[4] Danqi Chen and Christopher D Manning. 2014. A fast and accurate depen-
   dency parser using neural networks. In Empirical Methods in Natural Language
   Processing (EMNLP).

[5] Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A.
   Smith. 2015. Transition-based dependency parsing with stack long short-term
   memory. In Proceedings of the 53rd Annual Meeting of the Association for
   Computational Linguistics and the 7th International Joint Conference on Natu-
   ral Language Processing (Volume 1: Long Papers). Association for Computa-
   tional Linguistics, Beijing, China, pages 334–343.

[6] Yoav Goldberg and Michael Elhadad. 2010. Inspecting the structural biases of
   dependency parsing algorithms. In Proceedings of the Fourteenth Conference
   on Computational Natural Language Learning. Association for Computational
   Linguistics, pages 234–242.

[7] Angelika Kirilin and Yannick Versley. 2015. What is hard in Universal Depen-
   dency Parsing? In 6th Workshop on Statistical Parsing of Morphologically Rich
   Languages (SPMRL 2015). pages 31–38.

[8] Jonathan K Kummerfeld, David Hall, James R Curran, and Dan Klein. 2012.
   Parser showdown at the wall street corral: An empirical investigation of error
   types in parser output. In Proceedings of the 2012 Joint Conference on Em-
   pirical Methods in Natural Language Processing and Computational Natural
   Language Learning. Association for Computational Linguistics, pages 1048–
   1059.

[9] Adhiguna Kuncoro, Miguel Ballesteros, Lingpeng Kong, Chris Dyer, Graham
   Neubig, and Noah A Smith. 2016. What do recurrent neural network grammars
   learn about syntax? arXiv preprint arXiv:1611.05774 .


                                       109
[10] Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. 2016. Assessing
   the ability of lstms to learn syntax-sensitive dependencies. arXiv preprint
   arXiv:1611.01368 .
[11] Ryan McDonald and Joakim Nivre. 2007. Characterizing the errors of data-
   driven dependency parsing models. In Proceedings of the 2007 Joint Conference
   on Empirical Methods in Natural Language Processing and Computational Nat-
   ural Language Learning (EMNLP-CoNLL). pages 122–131.
[12] Dat Quoc Nguyen, Mark Dras, and Mark Johnson. 2016. An empirical study
   for vietnamese dependency parsing. In Proceedings of the Australasian Lan-
   guage Technology Association Workshop 2016. Melbourne, Australia, pages
   143–149.
[13] Joakim Nivre. 2016. Universal dependency evaluation. Unpublished paper .
[14] Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan
   Hajic, Christopher D Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo,
   Natalia Silveira, et al. 2016. Universal dependencies v1: A multilingual tree-
   bank collection. In Proceedings of the 10th International Conference on Lan-
   guage Resources and Evaluation (LREC 2016).
[15] Joakim Nivre, Johan Hall, Jens Nilsson, Atanas Chanev, Gülşen Eryiğit,
   Sandra Kübler, Svetoslav Marinov, and Erwin Marsi. 2007. MaltParser: A
   language-independent system for data-driven dependency parsing. Natural Lan-
   guage Engineering 13(2):95–135.
[16] Milan Straka, Jan Hajič, and Straková. 2016. UDPipe: trainable pipeline
   for processing CoNLL-U files performing tokenization, morphological analy-
   sis, pos tagging and parsing. In Proceedings of the Tenth International Confer-
   ence on Language Resources and Evaluation (LREC’16). European Language
   Resources Association (ELRA), Paris, France.
[17] Milan Straka, Jan Hajič, Jana Straková, and Jan Hajič jr. 2015. Parsing uni-
   versal dependency treebanks using neural networks and search-based oracle. In
   Proceedings of Fourteenth International Workshop on Treebanks and Linguistic
   Theories (TLT 14).
[18] David Weiss, Chris Alberti, Michael Collins, and Slav Petrov. 2015. Struc-
   tured training for neural network transition-based parsing. In Proceedings of the
   53rd Annual Meeting of the Association for Computational Linguistics and the
   7th International Joint Conference on Natural Language Processing (Volume 1:
   Long Papers). Association for Computational Linguistics, Beijing, China, pages
   323–333.
[19] Yue Zhang and Joakim Nivre. 2012. Analyzing the effect of global learn-
   ing and beam-search on transition-based dependency parsing. In COLING
   (Posters). pages 1391–1400.


                                        110