1. Introduction

Florian Schottmann

fschottmann@ethz.ch 0 1

Vincent Fortuin

fortuin@inf.ethz.ch 0 1

Edoardo Ponti

edoardo-maria.ponti@mila.quebec 1 2

Ryan Cotterell

ryan.cotterell@inf.ethz.ch 0 1 0 Department of Computer Science, ETH Zurich , Zurich , Switzerland 1 Lugano , Switzerland 2 Mila AI Centre , Quebec , Canada

Reranking the best hypothesis parse trees from an existing parser allows to take into account more information that the model has gathered during training than simply decoding the most likely dependency tree. In this paper, we first investigate whether state-of-the-art dependency parsers can still benefit from reranking in low-resource languages. As part of this analysis, we deliver new insights concerning rerankability. Second, we propose a reranker reject option, which paves the way for designing interpretable reranking-based parsing systems in the future.

1. Introduction

stream applications like natural language understanding tasks [ 2 ] or information extraction [ 3 ].

State-of-the-art dependency parsers predict the most likely dependency parse tree for a given sentence based on large neural network architectures [ 4 ]. By acting greedily with respect to the prediction, these models disregard information that was gathered during training. In particular, the best dependency tree might not have been assigned the highest probability, for example due to unusual training sentences [ 5 ]. A technique that tries to remedy this problem is -best reranking. It is based on decoding the best predictions from the base parser and reranking these using an additional machine learning model [6].

In this paper, we are investigating whether interpretable rerankers can improve current state-of-the-art dependency parsers. We challenge previous conclusions in which situations base parsers can be reranked and propose a novel approach to combining the base parser and the reranker.

2. Background

Many state-of-the-art dependency parsing systems are graph-based parsers, which are based on an edge-factored model (e.g. 4 and 7). The underlying assumption of an edge-factored model is that the score of a dependency

) ∈ . If we use the conditional probability of a dependency tree (given a particular sentence ) as a scoring function, we can write the model as exp{( , , ; )}

(1) ( |; = ) = ∑ 1

∏ ( ,, )∈ ∏

exp{( , , ; )} ′∈ full() ( ,, )∈ ′ where full() is the set of all dependency trees for sentence . is the edge scoring function and is the parameter vector.

From the model definition, we can see that any edgefactored model learns ×

edge scores for a sentence , where is the number of tokens in the sentence augmented by an artificial

token at the beginning of the sentence. Decoding the best dependency trees from such a score matrix after training is not possible naively, since the search space is exponential [8].

Recently, Zmigrod et al. [9] provided an algorithm which allows to decode the best dependency trees from an adjacency matrix induced by an edge-factored model, which we will use in this work to construct the list of hypothesis trees.

Often, building a reranking model on top of the base parser is not yet enough for state-of-the-art performance.

A popular strategy to improve a reranking model is mixparser and the reranker with a trade-of parameter that is tuned on the development set [10, 11]: = ( , ) + (1 − ) ( , ) (2) SwissText 2022: Swiss Text Analytics Conference, June 08–10, 2022, ture reranking (MR), i.e. combining the scores of the base © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License where ∈ [ 0, 1 ] is the mixing parameter. Attribution 4.0 International (CC BY 4.0).

is the score of the reranker and is the score of the base parser given a tree . are the respective parameter vectors and is the final score of the tree .

3. K-best list

Recently, Do and Rehbein [ 5 ] emphasized that, apart from the oracle parsing accuracy [6], the quality of the -best lists of a dataset can be indicated by the gold tree ratio and the unlabeled attachment scores (UAS) standard deviation in the -best list. The gold tree ratio refers to the number of times a correct tree is in the -best list relative to the number of sentences in the respective dataset. The oracle parsing accuracy is a similar metric introduced by Hall [6], referring to the maximum score possible by always choosing the closest tree to the gold tree in the -best list.

4. Reranker reject option

In this work, we reframe mixture reranking by using a reject option for the reranker. A reject option is a concept from decision theory; it refers to the idea of rejecting a classification decision if the associated probability is lower than a certain threshold [12]. In the context of reranking, the reject option works as follows: argmax ( ′| ) ∗ = { ′∈ 1 if max ( ′| ) >

′∈ else (3) where is the set of candidate parses for sentence and is the confidence threshold. If the confidence of the reranker is less than or equal to , the prediction of the base parser 1 is used.

Compared to mixture reranking, our method ofers a more interpretable way of trading of reranker and base parser. When using mixture reranking, it is not clear how much relative weight the reranker and the base parser get in the final decision of a parsing system if the base parser’s score is not a probability (unless ∈ {0, 1}). Indeed, the base parser score of Qi et al. [7] is not normalized over all valid dependency trees. By using a reranker reject option, we do not rely on the score of the base parser: We tune a threshold of minimum certainty that implicitly trades of reranker and base parser, but always leads to a clear decision whether the reranker or the base parser takes the final decision of the parsing system.

Treebank Lithuanian HSE Belarussian HSE Marathi UFAL

Tamil TTB treebanks [13], since our chosen base parser’s performance ofers most potential for improvement in the lowresource domain [14]. In particular, we decided to use Lithuanian, Belarussian, Marathi, and Tamil. The data split is indicated in Table 1.

We use the base parser of Qi et al. [7] and decode the -best list by using the algorithm of Zmigrod et al. [9]. As reranking models, we train structured support vector machine (SVM) and Gaussian process (GP) classification models on each of the four languages. All models are kernelized with the kernel of Collins et al. [15], which measures the similarity of two dependency trees in terms of the number of subtrees that they have in common. Thus, the kernel takes structural overlap as a measure of similarity between the trees [15]. Except for the kernel, no other features are used. We refrain from using neuralnetwork-based rerankers to avoid the construction of black-box features.

We fix the random seeds to guarantee reproducibility of our models. The implementation of the GPs (in particular the computation of the GP posterior) does not allow full reproducibility of the fitting procedure. To account for this randomness, we train the GP models over five diferent seeds.

We tune the inverse regularization parameter (only in case of the SVM), (the number of candidate trees at prediction time) and or simultaneously on the development set. In case of ties, we take the parameter combination corresponding to maximum regularization with the highest number of candidates. That corresponds to making a conservative decision while maximizing the diversity of the candidate parses, where the latter has been suggested to be useful by Do and Rehbein [ 5 ]. We choose the hyperparameter combination that achieves the highest (average) labeled attachment score (LAS) on the development set.

5.2. Evaluation

The baseline performance is generated by feeding the pretokenized sentences into the base parser and evaluating 5. Experimental Setup the predictions with the oficial CoNLL 2018 shared task evaluation script. That leads to considerably higher LAS 5.1. Training scores than reported by Qi et al. [14]. We hypothesize We ran our experiments on four low-resource lan- that this is the result of the fixed tokenization and the guages using data from the Universal Dependencies v2.5 ifxed sentence split, given that the CoNLL 2018 shared

6. Results 6.1. K-best list 6.2. Reranking performance

The performance in terms of the LAS of the diferent parsing systems is indicated in Table 2. We can see that augmenting the base parser by an interpretable reranker does not improve parsing performance drastically.

Looking at the chosen hyperparameters in Table 3, we can see that both methods are highly sensitive to the reranking model.

7. Discussion

Based on Figure 1, we generally see that even the strong base parser of Qi et al. [7] can benefit from -best reranking. We hypothesize that the gains in oracle LAS of additional candidates are crucial for reranking since those were the highest for Lithuanian and Marathi (the only languages where the base parser could be reranked on average).

Do and Rehbein [ 5 ] state that a prerequisite for successful neural reranking is a high gold tree ratio. In our opinion, this is not necessarily the case, given that we were able to realize small improvements on a language without a single gold tree in the -best lists (i.e.

Lithuanian) with non-neural models. In Belarussian, the language with the highest gold tree ratio, on the other hand, the reranker did not bring any benefit. The reason why Belarussian is hard to rerank might be that the base parser’s predictions are on average already close to the true trees. This is indicated by a steeply rising gold tree ratio but only marginal gains in oracle LAS scores. Additional similar trees result in a high variance of the reranker. As a result, the reranker uses strong regularization. Still, the reranker performs worse than the base parser, supposedly because the kernel function is not ideal for capturing the fine-grained diferences between these highly similar candidate trees [15]. In the Acknowledgements case of Tamil, we see a similar pattern. The higher oracle LAS growth and the less steep increase in the gold tree ratio however lead to less similar candidate trees. In We thank Ran Zmigrod for insightful discussions at an this situation, a strongly regularized reranker can repro- early stage of the project. duce the base parser performance, whereas biasing the reranker towards the base parser seems to rather distort References the predictions.

Based on the performance in Table 2 and the hyperparameter selection in Table 3, we see that both mixing methods choose similar hyperparameters and perform almost the same. Generally, the reranker reject option works better if the classifier outputs well-calibrated probabilities. Tuning the hyperparameter might thus be harder than tuning , since it is dependent on the quality of the reranker’s predictions.

8. Conclusion

All in all, we can observe from Table 2 that the gains of interpretable rerankers (with the kernel of 15) are not enough to justify their computational cost. This is in line with the recent literature on -best reranking [ 5 ].

Our experiments with the -best list show a positive growth of the oracle LAS, meaning that -best reranking provides possibilities to improve on strong base parsers.

However, the tested models cannot consistently leverage this potential.

By proposing a reranker reject option, we pave the way for fully interpretable parsing systems. Similar in performance to mixture reranking [10, 11], it allows a more ifne-grained analysis of rerankers and base parsers, since every parsing decision can be traced back to either of the parsing system’s components. Future work towards interpretable reranking could involve more sophisticated kernel functions that are able to better measure the subtle diferences between the candidate trees. Proceedings of the 58th Annual Meeting N. Aepli, H. Aghaei, Universal dependencies 2.5, of the Association for Computational Lin- 2019. URL: http://hdl.handle.net/11234/1-3105, guistics, Association for Computational Lin- LINDAT/CLARIAH-CZ digital library at the Instiguistics, Online, 2020, pp. 4123–4133. URL: tute of Formal and Applied Linguistics (ÚFAL), Fachttps://aclanthology.org/2020.acl-main.379. ulty of Mathematics and Physics, Charles Univerdoi:10.18653/v1/2020.acl-main.379. sity. [6] K. Hall, K-best spanning tree parsing, in: Proceed- [14] P. Qi, Y. Zhang, Y. Zhang, J. Bolton, C. D. Manings of the 45th Annual Meeting of the Associa- ning, Stanza: A Python natural language processtion of Computational Linguistics, Association for ing toolkit for many human languages, in: ProceedComputational Linguistics, Prague, Czech Republic, ings of the 58th Annual Meeting of the Association 2007, pp. 392–399. URL: https://aclanthology.org/ for Computational Linguistics: System DemonstraP07-1050. tions, 2020. URL: https://aclanthology.org/2020. [7] P. Qi, T. Dozat, Y. Zhang, C. D. Manning, Uni- acl-demos.14.pdf.

versal Dependency parsing from scratch, in: Pro- [15] M. Collins, N. Dufy, F. Park, Parsing with a single ceedings of the CoNLL 2018 Shared Task: Multilin- neuron: Convolution kernels for natural language gual Parsing from Raw Text to Universal Depen- problems, Technical report UCSC-CRL-01-01, dencies, Association for Computational Linguistics, University of California at Santa Cruz (2001). Brussels, Belgium, 2018, pp. 160–170. URL: https: URL: http://luthuli.cs.uiuc.edu/~daf/courses/ //aclanthology.org/K18-2016. doi:10.18653/v1/ learning/Kernelpapers/collins01parsing.pdf.

K18-2016. [16] D. Zeman, J. Hajič, M. Popel, M. Potthast, M. Straka, [8] R. McDonald, F. Pereira, K. Ribarov, J. Hajič, Non- F. Ginter, J. Nivre, S. Petrov, CoNLL 2018 shared projective dependency parsing using spanning tree task: Multilingual parsing from raw text to Unialgorithms, in: Proceedings of Human Language versal Dependencies, in: Proceedings of the Technology Conference and Conference on Empiri- CoNLL 2018 Shared Task: Multilingual Parsing cal Methods in Natural Language Processing, Asso- from Raw Text to Universal Dependencies, Associciation for Computational Linguistics, Vancouver, ation for Computational Linguistics, Brussels, BelBritish Columbia, Canada, 2005, pp. 523–530. URL: gium, 2018, pp. 1–21. URL: https://aclanthology. https://aclanthology.org/H05-1066. org/K18-2001. doi:10.18653/v1/K18-2001. [9] R. Zmigrod, T. Vieira, R. Cotterell, On find- [17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, ing the k-best non-projective dependency trees, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, in: Proceedings of the 59th Annual Meet- R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, ing of the Association for Computational Lin- D. Cournapeau, M. Brucher, M. Perrot, E. Duchguistics and the 11th International Joint Con- esnay, Scikit-learn: Machine learning in Python, ference on Natural Language Processing (Vol- Journal of Machine Learning Research 12 (2011) ume 1: Long Papers), Association for Computa- 2825–2830. URL: https://www.jmlr.org/papers/ tional Linguistics, Online, 2021, pp. 1324–1337. volume12/pedregosa11a/pedregosa11a.pdf. URL: https://aclanthology.org/2021.acl-long.106. [18] GPy, GPy: A gaussian process framework in python, doi:10.18653/v1/2021.acl-long.106. 2012. URL: http://github.com/SheffieldML/GPy. [10] K. Hayashi, S. Kondo, Y. Matsumoto, Efifcient stacked dependency parsing by forest reranking, Transactions of the Association for Computational Linguistics 1 (2013) 139–150.

URL: https://aclanthology.org/Q13-1012. doi:10.

1162/tacl_a_00216. [11] Q. Le, T. Mikolov, Distributed representations of sentences and documents, in: E. P. Xing, T. Jebara (Eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, PMLR, Bejing, China, 2014, pp. 1188–1196. URL: http: //proceedings.mlr.press/v32/le14.pdf. [12] C. M. Bishop, Pattern Recognition and Machine

Learning (Information Science and Statistics),

Springer-Verlag, Berlin, Heidelberg, 2006. [13] D. Zeman, J. Nivre, M. Abrams, E. Ackermann,

A.1. Implementation Details

We construct the training set for each of the chosen languages by decoding up to 9 trees from the weighted graph learned by the base parser with the algorithm of Zmigrod et al. [9]. Next, we add the corresponding true parse tree to the list. The is viewed as a hyperparameter and is thus not fixed a priori.

We used the implementation of the SVMs from sklearn [17] and the GP implementation of GPy [18]. We set the hyperparameter of the Collins et al. [15] kernel to 0.7 based on preliminary experiments on the Lithuanian training set.

The hyperparameters , , and are tuned simultaneously on the development set. The search space for the inverse regularization parameter c is (0.0001, 0.0027, 0.7356, 199.5262).

The search space for is 0 < < 16 for SVMs and 1 < < 16 for GPs. The latter is motivated by empirical reasons, i.e. the tendency of the GPs to always stick to the base parser unless they are forced not to. With both methods, the models can always fall back to the base parser if that is considered optimal.

For the search space for , we generated 20 evenly spaced values in the interval 0 ≤ ≤ 1 .

Finally, the search space for is (0, 0.3, 0.7, 1). Note that none of the values is too close to 0.5, since we expect high variance in the results if values near 0.5 are used as a cutof.

[1]

Kübler ,

McDonald ,

Nivre , Dependency parsing, Synthesis lectures on human language technologies 1 ( 2009 ) 1 - 127 .

[2]

S. R.

Bowman , J. Gauthier,

Rastogi ,

Gupta ,

C. D.

Manning ,

Potts , A fast unified model for parsing and sentence understanding, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Berlin, Germany, 2016 , pp. 1466 - 1477 . URL: https://aclanthology.org/P16-1139. doi: 10 . 18653/v1/ P16 -1139.

[3]

Angeli ,

M. J. Johnson

Premkumar ,

C. D.

Manning , Leveraging linguistic structure for open domain information extraction , in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Beijing, China, 2015 , pp. 344 - 354 . URL: https://aclanthology.org/P15-1034. doi: 10 . 3115/v1/ P15 -1034.

[4]

Straka ,

Straková ,

Hajič , Evaluating contextualized embeddings on 54 languages in pos tagging, lemmatization and dependency parsing , arXiv preprint arXiv: 1908 . 07448 ( 2019 ). URL: https: //arxiv.org/pdf/ 1908 .07448.pdf.

[5]

B.-N.

Do , I. Rehbein , Neural reranking for dependency parsing: An evaluation , in: