On Interpretable Reranking-Based Dependency Parsing
Systems
Florian Schottmann1 , Vincent Fortuin1 , Edoardo Ponti2 and Ryan Cotterell1
1
    Department of Computer Science, ETH Zurich, Zurich, Switzerland
2
    Mila AI Centre, Quebec, Canada


                                             Abstract
                                             Reranking the 𝑘 best hypothesis parse trees from an existing parser allows to take into account more information that the
                                             model has gathered during training than simply decoding the most likely dependency tree. In this paper, we first investigate
                                             whether state-of-the-art dependency parsers can still benefit from reranking in low-resource languages. As part of this
                                             analysis, we deliver new insights concerning rerankability. Second, we propose a reranker reject option, which paves the way
                                             for designing interpretable reranking-based parsing systems in the future.


1. Introduction                                           tree can be factored across the edges of the graph [1]. In
                                                          this work, we define the set of edges of a dependency
Dependency parse trees model the structural relation- tree 𝑇 as 𝐸𝑇 . The edges consist of a relation label 𝑟 ∈ 𝑅
ships in sentences [1]. They are relevant for many down- (where 𝑅 is the set of all possible relation labels) and
stream applications like natural language understanding two tokens 𝑤𝑖 and 𝑤𝑗 , i.e. (𝑤𝑖 , 𝑟, 𝑤𝑗 ) ∈ 𝐸𝑇 . If we use the
tasks [2] or information extraction [3].                  conditional probability of a dependency tree (given a
   State-of-the-art dependency parsers predict the most particular sentence 𝑆) as a scoring function, we can write
likely dependency parse tree for a given sentence based the model as
on large neural network architectures [4]. By acting
greedily with respect to the prediction, these models                               1
disregard information that was gathered during training.           𝑝(𝑇 |𝑆; 𝜃) =              ∏ exp{𝑠(𝑤𝑖 , 𝑟, 𝑤𝑗 ; 𝜃)} (1)
                                                                                    𝑍 (𝑤 ,𝑟,𝑤 )∈𝐸
In particular, the best dependency tree might not have                                     𝑖    𝑗    𝑇

been assigned the highest probability, for example due to         𝑍= ∑                       ∏ exp{𝑠(𝑤𝑖 , 𝑟, 𝑤𝑗 ; 𝜃)}
unusual training sentences [5]. A technique that tries to              𝑇 ′ ∈𝒯full (𝑆) (𝑤𝑖 ,𝑟,𝑤𝑗 )∈𝐸𝑇 ′
remedy this problem is 𝑘-best reranking. It is based on      where 𝒯full (𝑆) is the set of all dependency trees for
decoding the 𝑘 best predictions from the base parser and sentence 𝑆. 𝑠 is the edge scoring function and 𝜃 is the
reranking these using an additional machine learning parameter vector.
model [6].                                                   From the model definition, we can see that any edge-
   In this paper, we are investigating whether inter- factored model learns 𝑛 × 𝑛 edge scores for a sentence
pretable rerankers can improve current state-of-the-art 𝑆, where 𝑛 is the number of tokens in the sentence aug-
dependency parsers. We challenge previous conclusions mented by an artificial 𝑟𝑜𝑜𝑡 token at the beginning of the
in which situations base parsers can be reranked and sentence. Decoding the 𝑘 best dependency trees from
propose a novel approach to combining the base parser such a score matrix after training is not possible naively,
and the reranker.                                         since the search space is exponential [8].
                                                                      Recently, Zmigrod et al. [9] provided an algorithm
2. Background                                                      which allows to decode the 𝑘 best dependency trees from
                                                                   an adjacency matrix induced by an edge-factored model,
Many state-of-the-art dependency parsing systems are which we will use in this work to construct the list of
graph-based parsers, which are based on an edge-factored hypothesis trees.
model (e.g. 4 and 7). The underlying assumption of an                 Often, building a reranking model on top of the base
edge-factored model is that the score of a dependency parser is not yet enough for state-of-the-art performance.
                                                                   A popular strategy to improve a reranking model is mix-
SwissText 2022: Swiss Text Analytics Conference, June 08–10, 2022, ture reranking (MR), i.e. combining the scores of the base
Lugano, Switzerland                                                parser and the reranker with a trade-off parameter that
Envelope-Open fschottmann@ethz.ch (F. Schottmann);                 is tuned on the development set [10, 11]:
fortuin@inf.ethz.ch (V. Fortuin);
edoardo-maria.ponti@mila.quebec (E. Ponti);                                                                                                     𝑠𝑓 = 𝛼 𝑠𝑅𝑅 (𝑇 , 𝜃𝑅𝑅 ) + (1 − 𝛼) 𝑠𝐵 (𝑇 , 𝜃𝐵 )   (2)
ryan.cotterell@inf.ethz.ch (R. Cotterell)
                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).
                                                                                                                                      where 𝛼 ∈ [0, 1] is the mixing parameter. 𝑠𝑅𝑅 is the score
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                                        of the reranker and 𝑠𝐵 is the score of the base parser given
a tree 𝑇. 𝜃 are the respective parameter vectors and 𝑠𝑓 is       Table 1
the final score of the tree 𝑇.                                   Dataset Lengths
                                                                            Treebank        Train    Dev    Test
3. K-best list                                                           Lithuanian HSE
                                                                         Belarussian HSE
                                                                                             153
                                                                                             319
                                                                                                      55
                                                                                                      65
                                                                                                             55
                                                                                                            253
Recently, Do and Rehbein [5] emphasized that, apart                       Marathi UFAL       373      46     47
                                                                            Tamil TTB        400      80    120
from the oracle parsing accuracy [6], the quality of the
𝑘-best lists of a dataset can be indicated by the gold tree
ratio and the unlabeled attachment scores (UAS) standard
                                                                 treebanks [13], since our chosen base parser’s perfor-
deviation in the 𝑘-best list. The gold tree ratio refers to
                                                                 mance offers most potential for improvement in the low-
the number of times a correct tree is in the 𝑘-best list
                                                                 resource domain [14]. In particular, we decided to use
relative to the number of sentences in the respective
                                                                 Lithuanian, Belarussian, Marathi, and Tamil. The data
dataset. The oracle parsing accuracy is a similar metric
                                                                 split is indicated in Table 1.
introduced by Hall [6], referring to the maximum score
                                                                    We use the base parser of Qi et al. [7] and decode the
possible by always choosing the closest tree to the gold
                                                                 𝑘-best list by using the algorithm of Zmigrod et al. [9].
tree in the 𝑘-best list.
                                                                 As reranking models, we train structured support vector
                                                                 machine (SVM) and Gaussian process (GP) classification
4. Reranker reject option                                        models on each of the four languages. All models are
                                                                 kernelized with the kernel of Collins et al. [15], which
In this work, we reframe mixture reranking by using a            measures the similarity of two dependency trees in terms
reject option for the reranker. A reject option is a concept     of the number of subtrees that they have in common.
from decision theory; it refers to the idea of rejecting         Thus, the kernel takes structural overlap as a measure of
a classification decision if the associated probability is       similarity between the trees [15]. Except for the kernel,
lower than a certain threshold 𝜏 [12]. In the context of         no other features are used. We refrain from using neural-
reranking, the reject option works as follows:                   network-based rerankers to avoid the construction of
                                                                 black-box features.
             argmax 𝑝 (𝑇 ′ |𝑆)   if max 𝑝 (𝑇 ′ |𝑆) > 𝜏              We fix the random seeds to guarantee reproducibility
    𝑇∗ = {
                                    ′
                                   𝑇 ∈𝒯𝑘
              𝑇 ′ ∈𝒯𝑘                                      (3)   of our models. The implementation of the GPs (in partic-
             𝑇1                  else                            ular the computation of the GP posterior) does not allow
    where 𝒯𝑘 is the set of candidate parses for sentence         full reproducibility of the fitting procedure. To account
𝑆 and 𝜏 is the confidence threshold. If the confidence of        for this randomness, we train the GP models over five
the reranker is less than or equal to 𝜏, the prediction of       different seeds.
the base parser 𝑇1 is used.                                         We tune the inverse regularization parameter 𝑐 (only
    Compared to mixture reranking, our method offers             in case of the SVM), 𝑘 (the number of candidate trees
a more interpretable way of trading off reranker and             at prediction time) and 𝛼 or 𝜏 simultaneously on the de-
base parser. When using mixture reranking, it is not             velopment set. In case of ties, we take the parameter
clear how much relative weight the reranker and the              combination corresponding to maximum regularization
base parser get in the final decision of a parsing system        with the highest number of candidates. That corresponds
if the base parser’s score is not a probability (unless 𝛼 ∈      to making a conservative decision while maximizing the
{0, 1}). Indeed, the base parser score of Qi et al. [7] is not   diversity of the candidate parses, where the latter has
normalized over all valid dependency trees. By using a           been suggested to be useful by Do and Rehbein [5]. We
reranker reject option, we do not rely on the score of the       choose the hyperparameter combination that achieves
base parser: We tune a threshold of minimum certainty            the highest (average) labeled attachment score (LAS) on
that implicitly trades off reranker and base parser, but         the development set.
always leads to a clear decision whether the reranker or
the base parser takes the final decision of the parsing          5.2. Evaluation
system.
                                                       The baseline performance is generated by feeding the pre-
                                                       tokenized sentences into the base parser and evaluating
5. Experimental Setup                                  the predictions with the official CoNLL 2018 shared task
                                                       evaluation script. That leads to considerably higher LAS
5.1. Training                                          scores than reported by Qi et al. [14]. We hypothesize
We ran our experiments on four low-resource lan- that this is the result of the fixed tokenization and the
guages using data from the Universal Dependencies v2.5 fixed sentence split, given that the CoNLL 2018 shared
                                                                For all investigated languages, the SVM-based system
                                                             with the reranker reject option trusts the reranker. This
                                                             is indicated by 𝜏 = 0. The corresponding system with
                                                             mixture reranking reproduces this decision for Lithua-
                                                             nian and Marathi (i.e. 𝛼 = 1), while it involves the base
                                                             parser in the classification decision for Belarussian and
                                                             Tamil. Still, 𝛼 is also close to 1 in these cases, indicating
                                                             the proximity to the reranker reject option. Interestingly,
                                                             we can see in Belarussian and Tamil that the system with
                                                             the reranker reject option is more regularized (i.e. smaller
                                                             optimal 𝑐 and 𝑘) than the system with mixture reranking
                                                             (which involves the base parser). Both SVM-based sys-
                                                             tems slightly outperform the base parser in Lithuanian.
                                                                The GP-based systems trust the reranker less. For
                                                             Tamil, both systems fully commit to the base parser (i.e.
Figure 1: Growth of oracle LAS and gold tree ratio           𝛼 = 0 and 𝜏 = 1). Note that in these situations, based on
                                                             the principles by which we choose the hyperparameters
                                                             (see Section 5.1), the chosen 𝑘 always equals 15. Both GP-
task required a prediction from raw text.                    based systems take related decisions for Marathi, where
   All trained models are evaluated in terms of the la- both involve base parser and reranker in the final de-
beled attachment score (LAS), which measures how many cision. This leads to a slight average outperformance.
words were assigned the correct head with the correct Interestingly, the system with the reranker reject option
dependency label. In particular, it is an F1 score, i.e. the generally chooses a smaller optimal 𝑘 (also for the SVM-
harmonic mean of labeled precision and labeled recall based systems). The decisions of both GP-based systems
[16].                                                        are almost contrary for Lithuanian and Belarussian. None
                                                             is clearly better than the other based on the results.

6. Results
                                                               7. Discussion
6.1. K-best list
                                                               Based on Figure 1, we generally see that even the strong
Figure 1 shows the relative growth of the oracle LAS           base parser of Qi et al. [7] can benefit from 𝑘-best rerank-
for different 𝑘. All values are with respect to the base       ing. We hypothesize that the gains in oracle LAS of ad-
parser score, which explains why the curve is weakly           ditional candidates are crucial for reranking since those
increasing for every language. The growth for Lithuanian       were the highest for Lithuanian and Marathi (the only
is the largest with more than 10%, whereas the oracle          languages where the base parser could be reranked on
LAS grows less than 5% for Belarussian. Marathi and            average).
Tamil are in between with a growth of approximately 8%.           Do and Rehbein [5] state that a prerequisite for suc-
   The gold tree ratio is also weakly increasing, given        cessful neural reranking is a high gold tree ratio. In
that more candidate trees can never lead to fewer gold         our opinion, this is not necessarily the case, given that
trees in the set of candidate parses. The gold tree ratio is   we were able to realize small improvements on a lan-
constantly at zero for Lithuanian. For Tamil, it is always     guage without a single gold tree in the 𝑘-best lists (i.e.
near 5%, while for Marathi it is stable at 15%. The only       Lithuanian) with non-neural models. In Belarussian, the
gold tree ratio that shows significant improvement is for      language with the highest gold tree ratio, on the other
Belarussian, from approximately 10% to approximately           hand, the reranker did not bring any benefit. The rea-
20%.                                                           son why Belarussian is hard to rerank might be that the
                                                               base parser’s predictions are on average already close
6.2. Reranking performance                                     to the true trees. This is indicated by a steeply rising
                                                               gold tree ratio but only marginal gains in oracle LAS
The performance in terms of the LAS of the different           scores. Additional similar trees result in a high variance
parsing systems is indicated in Table 2. We can see that       of the reranker. As a result, the reranker uses strong
augmenting the base parser by an interpretable reranker        regularization. Still, the reranker performs worse than
does not improve parsing performance drastically.              the base parser, supposedly because the kernel function
   Looking at the chosen hyperparameters in Table 3, we        is not ideal for capturing the fine-grained differences be-
can see that both methods are highly sensitive to the          tween these highly similar candidate trees [15]. In the
reranking model.
Table 2
LAS of different reranking-based parsing systems
                 Lithuanian HSE              Belarussian HSE               Marathi UFAL                   Tamil TTB
              Mean Median      Std        Mean Median       Std        Mean Median      Std          Mean Median      Std
Base Parser   42.08    42.08     0        81.55    81.55     0         60.44   60.44      0          67.67   67.67     0
 SVM MR       42.17    42.17     0        81.55    81.55     0         59.95   59.95     0           67.57    67.57    0
SVM Reject    42.17    42.17     0        81.53    81.53     0         59.95   59.95     0           67.67   67.67     0
  GP MR       42.08    42.08     0        81.56    81.55   0.04        60.49   60.44    0.11         67.67   67.67     0
 GP Reject    41.87    41.98   0.29       81.55    81.55     0         60.53   60.68    0.22         67.67   67.67     0


Table 3
Optimized hyperparameters of different reranking-based parsing systems
                    Lithuanian HSE           Belarussian HSE             Marathi UFAL                 Tamil TTB
   SVM MR         𝛼 = 1, 𝑘 = 6, 𝑐 = 199   𝛼 = 0.89, 𝑘 = 15, 𝑐 = 199   𝛼 = 1, 𝑘 = 2, 𝑐 = 199    𝛼 = 0.95, 𝑘 = 15, 𝑐 = 199
  SVM Reject      𝜏 = 0, 𝑘 = 6, 𝑐 = 199    𝜏 = 0, 𝑘 = 8, 𝑐 = 0.003    𝜏 = 0, 𝑘 = 2, 𝑐 = 199     𝜏 = 0, 𝑘 = 2, 𝑐 = 0.003
    GP MR             𝛼 = 0, 𝑘 = 15            𝛼 = 0.89, 𝑘 = 7          𝛼 = 0.95, 𝑘 = 15             𝛼 = 0, 𝑘 = 15
   GP Reject          𝜏 = 0.3, 𝑘 = 3            𝜏 = 1, 𝑘 = 15            𝜏 = 0.3, 𝑘 = 2              𝜏 = 1, 𝑘 = 15


case of Tamil, we see a similar pattern. The higher ora- Acknowledgements
cle LAS growth and the less steep increase in the gold
tree ratio however lead to less similar candidate trees. In We thank Ran Zmigrod for insightful discussions at an
this situation, a strongly regularized reranker can repro- early stage of the project.
duce the base parser performance, whereas biasing the
reranker towards the base parser seems to rather distort
the predictions.
                                                             References
   Based on the performance in Table 2 and the hyper- [1] S. Kübler, R. McDonald, J. Nivre, Dependency pars-
parameter selection in Table 3, we see that both mixing          ing, Synthesis lectures on human language tech-
methods choose similar hyperparameters and perform               nologies 1 (2009) 1–127.
almost the same. Generally, the reranker reject option       [2] S. R. Bowman, J. Gauthier, A. Rastogi, R. Gupta,
works better if the classifier outputs well-calibrated prob-     C. D. Manning, C. Potts, A fast unified model
abilities. Tuning the hyperparameter 𝜏 might thus be             for parsing and sentence understanding, in: Pro-
harder than tuning 𝛼, since it is dependent on the quality       ceedings of the 54th Annual Meeting of the As-
of the reranker’s predictions.                                   sociation for Computational Linguistics (Volume
                                                                 1: Long Papers), Association for Computational
8. Conclusion                                                    Linguistics, Berlin, Germany, 2016, pp. 1466–1477.
                                                                 URL: https://aclanthology.org/P16-1139. doi:10.
All in all, we can observe from Table 2 that the gains of        18653/v1/P16-1139.
interpretable rerankers (with the kernel of 15) are not      [3] G. Angeli, M. J. Johnson Premkumar, C. D. Manning,
enough to justify their computational cost. This is in           Leveraging linguistic structure for open domain in-
line with the recent literature on 𝑘-best reranking [5].         formation extraction, in: Proceedings of the 53rd
Our experiments with the 𝑘-best list show a positive             Annual Meeting of the Association for Computa-
growth of the oracle LAS, meaning that 𝑘-best reranking          tional Linguistics and the 7th International Joint
provides possibilities to improve on strong base parsers.        Conference on Natural Language Processing (Vol-
However, the tested models cannot consistently leverage          ume 1: Long Papers), Association for Computa-
this potential.                                                  tional Linguistics, Beijing, China, 2015, pp. 344–354.
   By proposing a reranker reject option, we pave the way        URL: https://aclanthology.org/P15-1034. doi:10.
for fully interpretable parsing systems. Similar in per-         3115/v1/P15-1034.
formance to mixture reranking [10, 11], it allows a more     [4] M. Straka, J. Straková, J. Hajič, Evaluating con-
fine-grained analysis of rerankers and base parsers, since       textualized embeddings on 54 languages in pos
every parsing decision can be traced back to either of           tagging, lemmatization and dependency parsing,
the parsing system’s components. Future work towards             arXiv preprint arXiv:1908.07448 (2019). URL: https:
interpretable reranking could involve more sophisticated         //arxiv.org/pdf/1908.07448.pdf.
kernel functions that are able to better measure the subtle  [5] B.-N. Do, I. Rehbein,         Neural reranking for
differences between the candidate trees.                         dependency parsing: An evaluation,                 in:
     Proceedings of the 58th Annual Meeting                     N. Aepli, H. Aghaei, Universal dependencies 2.5,
     of the Association for Computational Lin-                  2019. URL: http://hdl.handle.net/11234/1-3105,
     guistics, Association for Computational Lin-               LINDAT/CLARIAH-CZ digital library at the Insti-
     guistics, Online, 2020, pp. 4123–4133. URL:                tute of Formal and Applied Linguistics (ÚFAL), Fac-
     https://aclanthology.org/2020.acl-main.379.                ulty of Mathematics and Physics, Charles Univer-
     doi:10.18653/v1/2020.acl-main.379.                         sity.
 [6] K. Hall, K-best spanning tree parsing, in: Proceed-   [14] P. Qi, Y. Zhang, Y. Zhang, J. Bolton, C. D. Man-
     ings of the 45th Annual Meeting of the Associa-            ning, Stanza: A Python natural language process-
     tion of Computational Linguistics, Association for         ing toolkit for many human languages, in: Proceed-
     Computational Linguistics, Prague, Czech Republic,         ings of the 58th Annual Meeting of the Association
     2007, pp. 392–399. URL: https://aclanthology.org/          for Computational Linguistics: System Demonstra-
     P07-1050.                                                  tions, 2020. URL: https://aclanthology.org/2020.
 [7] P. Qi, T. Dozat, Y. Zhang, C. D. Manning, Uni-             acl-demos.14.pdf.
     versal Dependency parsing from scratch, in: Pro-      [15] M. Collins, N. Duffy, F. Park, Parsing with a single
     ceedings of the CoNLL 2018 Shared Task: Multilin-          neuron: Convolution kernels for natural language
     gual Parsing from Raw Text to Universal Depen-             problems, Technical report UCSC-CRL-01-01,
     dencies, Association for Computational Linguistics,        University of California at Santa Cruz (2001).
     Brussels, Belgium, 2018, pp. 160–170. URL: https:          URL: http://luthuli.cs.uiuc.edu/~daf/courses/
     //aclanthology.org/K18-2016. doi:10.18653/v1/              learning/Kernelpapers/collins01parsing.pdf.
     K18-2016.                                             [16] D. Zeman, J. Hajič, M. Popel, M. Potthast, M. Straka,
 [8] R. McDonald, F. Pereira, K. Ribarov, J. Hajič, Non-        F. Ginter, J. Nivre, S. Petrov, CoNLL 2018 shared
     projective dependency parsing using spanning tree          task: Multilingual parsing from raw text to Uni-
     algorithms, in: Proceedings of Human Language              versal Dependencies, in: Proceedings of the
     Technology Conference and Conference on Empiri-            CoNLL 2018 Shared Task: Multilingual Parsing
     cal Methods in Natural Language Processing, Asso-          from Raw Text to Universal Dependencies, Associ-
     ciation for Computational Linguistics, Vancouver,          ation for Computational Linguistics, Brussels, Bel-
     British Columbia, Canada, 2005, pp. 523–530. URL:          gium, 2018, pp. 1–21. URL: https://aclanthology.
     https://aclanthology.org/H05-1066.                         org/K18-2001. doi:10.18653/v1/K18-2001.
 [9] R. Zmigrod, T. Vieira, R. Cotterell, On find-         [17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
     ing the k-best non-projective dependency trees,            B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
     in: Proceedings of the 59th Annual Meet-                   R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
     ing of the Association for Computational Lin-              D. Cournapeau, M. Brucher, M. Perrot, E. Duch-
     guistics and the 11th International Joint Con-             esnay, Scikit-learn: Machine learning in Python,
     ference on Natural Language Processing (Vol-               Journal of Machine Learning Research 12 (2011)
     ume 1: Long Papers), Association for Computa-              2825–2830. URL: https://www.jmlr.org/papers/
     tional Linguistics, Online, 2021, pp. 1324–1337.           volume12/pedregosa11a/pedregosa11a.pdf.
     URL: https://aclanthology.org/2021.acl-long.106.      [18] GPy, GPy: A gaussian process framework in python,
     doi:10.18653/v1/2021.acl-long.106.                         2012. URL: http://github.com/SheffieldML/GPy.
[10] K. Hayashi, S. Kondo, Y. Matsumoto,             Ef-
     ficient stacked dependency parsing by forest
     reranking,      Transactions of the Association
     for Computational Linguistics 1 (2013) 139–150.
     URL: https://aclanthology.org/Q13-1012. doi:10.
     1162/tacl_a_00216.
[11] Q. Le, T. Mikolov, Distributed representations of
     sentences and documents, in: E. P. Xing, T. Je-
     bara (Eds.), Proceedings of the 31st International
     Conference on Machine Learning, volume 32 of
     Proceedings of Machine Learning Research, PMLR,
     Bejing, China, 2014, pp. 1188–1196. URL: http:
     //proceedings.mlr.press/v32/le14.pdf.
[12] C. M. Bishop, Pattern Recognition and Machine
     Learning (Information Science and Statistics),
     Springer-Verlag, Berlin, Heidelberg, 2006.
[13] D. Zeman, J. Nivre, M. Abrams, E. Ackermann,
A. Appendix
A.1. Implementation Details
We construct the training set for each of the chosen lan-
guages by decoding up to 9 trees from the weighted graph
learned by the base parser with the algorithm of Zmigrod
et al. [9]. Next, we add the corresponding true parse tree
to the list. The 𝑘 is viewed as a hyperparameter and is
thus not fixed a priori.
   We used the implementation of the SVMs from sklearn
[17] and the GP implementation of GPy [18]. We set
the hyperparameter 𝜆 of the Collins et al. [15] kernel to
0.7 based on preliminary experiments on the Lithuanian
training set.
   The hyperparameters 𝑐, 𝑘, 𝛼 and 𝜏 are tuned si-
multaneously on the development set. The search
space for the inverse regularization parameter c is
(0.0001, 0.0027, 0.7356, 199.5262).
   The search space for 𝑘 is 0 < 𝑘 < 16 for SVMs and
1 < 𝑘 < 16 for GPs. The latter is motivated by empirical
reasons, i.e. the tendency of the GPs to always stick to
the base parser unless they are forced not to. With both
methods, the models can always fall back to the base
parser if that is considered optimal.
   For the search space for 𝛼, we generated 20 evenly
spaced values in the interval 0 ≤ 𝑥 ≤ 1.
   Finally, the search space for 𝜏 is (0, 0.3, 0.7, 1). Note
that none of the values is too close to 0.5, since we expect
high variance in the results if values near 0.5 are used as
a cutoff.