On Interpretable Reranking-Based Dependency Parsing Systems Florian Schottmann1 , Vincent Fortuin1 , Edoardo Ponti2 and Ryan Cotterell1 1 Department of Computer Science, ETH Zurich, Zurich, Switzerland 2 Mila AI Centre, Quebec, Canada Abstract Reranking the π‘˜ best hypothesis parse trees from an existing parser allows to take into account more information that the model has gathered during training than simply decoding the most likely dependency tree. In this paper, we first investigate whether state-of-the-art dependency parsers can still benefit from reranking in low-resource languages. As part of this analysis, we deliver new insights concerning rerankability. Second, we propose a reranker reject option, which paves the way for designing interpretable reranking-based parsing systems in the future. 1. Introduction tree can be factored across the edges of the graph [1]. In this work, we define the set of edges of a dependency Dependency parse trees model the structural relation- tree 𝑇 as 𝐸𝑇 . The edges consist of a relation label π‘Ÿ ∈ 𝑅 ships in sentences [1]. They are relevant for many down- (where 𝑅 is the set of all possible relation labels) and stream applications like natural language understanding two tokens 𝑀𝑖 and 𝑀𝑗 , i.e. (𝑀𝑖 , π‘Ÿ, 𝑀𝑗 ) ∈ 𝐸𝑇 . If we use the tasks [2] or information extraction [3]. conditional probability of a dependency tree (given a State-of-the-art dependency parsers predict the most particular sentence 𝑆) as a scoring function, we can write likely dependency parse tree for a given sentence based the model as on large neural network architectures [4]. By acting greedily with respect to the prediction, these models 1 disregard information that was gathered during training. 𝑝(𝑇 |𝑆; πœƒ) = ∏ exp{𝑠(𝑀𝑖 , π‘Ÿ, 𝑀𝑗 ; πœƒ)} (1) 𝑍 (𝑀 ,π‘Ÿ,𝑀 )∈𝐸 In particular, the best dependency tree might not have 𝑖 𝑗 𝑇 been assigned the highest probability, for example due to 𝑍= βˆ‘ ∏ exp{𝑠(𝑀𝑖 , π‘Ÿ, 𝑀𝑗 ; πœƒ)} unusual training sentences [5]. A technique that tries to 𝑇 β€² βˆˆπ’―full (𝑆) (𝑀𝑖 ,π‘Ÿ,𝑀𝑗 )βˆˆπΈπ‘‡ β€² remedy this problem is π‘˜-best reranking. It is based on where 𝒯full (𝑆) is the set of all dependency trees for decoding the π‘˜ best predictions from the base parser and sentence 𝑆. 𝑠 is the edge scoring function and πœƒ is the reranking these using an additional machine learning parameter vector. model [6]. From the model definition, we can see that any edge- In this paper, we are investigating whether inter- factored model learns 𝑛 Γ— 𝑛 edge scores for a sentence pretable rerankers can improve current state-of-the-art 𝑆, where 𝑛 is the number of tokens in the sentence aug- dependency parsers. We challenge previous conclusions mented by an artificial π‘Ÿπ‘œπ‘œπ‘‘ token at the beginning of the in which situations base parsers can be reranked and sentence. Decoding the π‘˜ best dependency trees from propose a novel approach to combining the base parser such a score matrix after training is not possible naively, and the reranker. since the search space is exponential [8]. Recently, Zmigrod et al. [9] provided an algorithm 2. Background which allows to decode the π‘˜ best dependency trees from an adjacency matrix induced by an edge-factored model, Many state-of-the-art dependency parsing systems are which we will use in this work to construct the list of graph-based parsers, which are based on an edge-factored hypothesis trees. model (e.g. 4 and 7). The underlying assumption of an Often, building a reranking model on top of the base edge-factored model is that the score of a dependency parser is not yet enough for state-of-the-art performance. A popular strategy to improve a reranking model is mix- SwissText 2022: Swiss Text Analytics Conference, June 08–10, 2022, ture reranking (MR), i.e. combining the scores of the base Lugano, Switzerland parser and the reranker with a trade-off parameter that Envelope-Open fschottmann@ethz.ch (F. Schottmann); is tuned on the development set [10, 11]: fortuin@inf.ethz.ch (V. Fortuin); edoardo-maria.ponti@mila.quebec (E. Ponti); 𝑠𝑓 = 𝛼 𝑠𝑅𝑅 (𝑇 , πœƒπ‘…π‘… ) + (1 βˆ’ 𝛼) 𝑠𝐡 (𝑇 , πœƒπ΅ ) (2) ryan.cotterell@inf.ethz.ch (R. Cotterell) Β© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). where 𝛼 ∈ [0, 1] is the mixing parameter. 𝑠𝑅𝑅 is the score CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) of the reranker and 𝑠𝐡 is the score of the base parser given a tree 𝑇. πœƒ are the respective parameter vectors and 𝑠𝑓 is Table 1 the final score of the tree 𝑇. Dataset Lengths Treebank Train Dev Test 3. K-best list Lithuanian HSE Belarussian HSE 153 319 55 65 55 253 Recently, Do and Rehbein [5] emphasized that, apart Marathi UFAL 373 46 47 Tamil TTB 400 80 120 from the oracle parsing accuracy [6], the quality of the π‘˜-best lists of a dataset can be indicated by the gold tree ratio and the unlabeled attachment scores (UAS) standard treebanks [13], since our chosen base parser’s perfor- deviation in the π‘˜-best list. The gold tree ratio refers to mance offers most potential for improvement in the low- the number of times a correct tree is in the π‘˜-best list resource domain [14]. In particular, we decided to use relative to the number of sentences in the respective Lithuanian, Belarussian, Marathi, and Tamil. The data dataset. The oracle parsing accuracy is a similar metric split is indicated in Table 1. introduced by Hall [6], referring to the maximum score We use the base parser of Qi et al. [7] and decode the possible by always choosing the closest tree to the gold π‘˜-best list by using the algorithm of Zmigrod et al. [9]. tree in the π‘˜-best list. As reranking models, we train structured support vector machine (SVM) and Gaussian process (GP) classification 4. Reranker reject option models on each of the four languages. All models are kernelized with the kernel of Collins et al. [15], which In this work, we reframe mixture reranking by using a measures the similarity of two dependency trees in terms reject option for the reranker. A reject option is a concept of the number of subtrees that they have in common. from decision theory; it refers to the idea of rejecting Thus, the kernel takes structural overlap as a measure of a classification decision if the associated probability is similarity between the trees [15]. Except for the kernel, lower than a certain threshold 𝜏 [12]. In the context of no other features are used. We refrain from using neural- reranking, the reject option works as follows: network-based rerankers to avoid the construction of black-box features. argmax 𝑝 (𝑇 β€² |𝑆) if max 𝑝 (𝑇 β€² |𝑆) > 𝜏 We fix the random seeds to guarantee reproducibility π‘‡βˆ— = { β€² 𝑇 βˆˆπ’―π‘˜ 𝑇 β€² βˆˆπ’―π‘˜ (3) of our models. The implementation of the GPs (in partic- 𝑇1 else ular the computation of the GP posterior) does not allow where π’―π‘˜ is the set of candidate parses for sentence full reproducibility of the fitting procedure. To account 𝑆 and 𝜏 is the confidence threshold. If the confidence of for this randomness, we train the GP models over five the reranker is less than or equal to 𝜏, the prediction of different seeds. the base parser 𝑇1 is used. We tune the inverse regularization parameter 𝑐 (only Compared to mixture reranking, our method offers in case of the SVM), π‘˜ (the number of candidate trees a more interpretable way of trading off reranker and at prediction time) and 𝛼 or 𝜏 simultaneously on the de- base parser. When using mixture reranking, it is not velopment set. In case of ties, we take the parameter clear how much relative weight the reranker and the combination corresponding to maximum regularization base parser get in the final decision of a parsing system with the highest number of candidates. That corresponds if the base parser’s score is not a probability (unless 𝛼 ∈ to making a conservative decision while maximizing the {0, 1}). Indeed, the base parser score of Qi et al. [7] is not diversity of the candidate parses, where the latter has normalized over all valid dependency trees. By using a been suggested to be useful by Do and Rehbein [5]. We reranker reject option, we do not rely on the score of the choose the hyperparameter combination that achieves base parser: We tune a threshold of minimum certainty the highest (average) labeled attachment score (LAS) on that implicitly trades off reranker and base parser, but the development set. always leads to a clear decision whether the reranker or the base parser takes the final decision of the parsing 5.2. Evaluation system. The baseline performance is generated by feeding the pre- tokenized sentences into the base parser and evaluating 5. Experimental Setup the predictions with the official CoNLL 2018 shared task evaluation script. That leads to considerably higher LAS 5.1. Training scores than reported by Qi et al. [14]. We hypothesize We ran our experiments on four low-resource lan- that this is the result of the fixed tokenization and the guages using data from the Universal Dependencies v2.5 fixed sentence split, given that the CoNLL 2018 shared For all investigated languages, the SVM-based system with the reranker reject option trusts the reranker. This is indicated by 𝜏 = 0. The corresponding system with mixture reranking reproduces this decision for Lithua- nian and Marathi (i.e. 𝛼 = 1), while it involves the base parser in the classification decision for Belarussian and Tamil. Still, 𝛼 is also close to 1 in these cases, indicating the proximity to the reranker reject option. Interestingly, we can see in Belarussian and Tamil that the system with the reranker reject option is more regularized (i.e. smaller optimal 𝑐 and π‘˜) than the system with mixture reranking (which involves the base parser). Both SVM-based sys- tems slightly outperform the base parser in Lithuanian. The GP-based systems trust the reranker less. For Tamil, both systems fully commit to the base parser (i.e. Figure 1: Growth of oracle LAS and gold tree ratio 𝛼 = 0 and 𝜏 = 1). Note that in these situations, based on the principles by which we choose the hyperparameters (see Section 5.1), the chosen π‘˜ always equals 15. Both GP- task required a prediction from raw text. based systems take related decisions for Marathi, where All trained models are evaluated in terms of the la- both involve base parser and reranker in the final de- beled attachment score (LAS), which measures how many cision. This leads to a slight average outperformance. words were assigned the correct head with the correct Interestingly, the system with the reranker reject option dependency label. In particular, it is an F1 score, i.e. the generally chooses a smaller optimal π‘˜ (also for the SVM- harmonic mean of labeled precision and labeled recall based systems). The decisions of both GP-based systems [16]. are almost contrary for Lithuanian and Belarussian. None is clearly better than the other based on the results. 6. Results 7. Discussion 6.1. K-best list Based on Figure 1, we generally see that even the strong Figure 1 shows the relative growth of the oracle LAS base parser of Qi et al. [7] can benefit from π‘˜-best rerank- for different π‘˜. All values are with respect to the base ing. We hypothesize that the gains in oracle LAS of ad- parser score, which explains why the curve is weakly ditional candidates are crucial for reranking since those increasing for every language. The growth for Lithuanian were the highest for Lithuanian and Marathi (the only is the largest with more than 10%, whereas the oracle languages where the base parser could be reranked on LAS grows less than 5% for Belarussian. Marathi and average). Tamil are in between with a growth of approximately 8%. Do and Rehbein [5] state that a prerequisite for suc- The gold tree ratio is also weakly increasing, given cessful neural reranking is a high gold tree ratio. In that more candidate trees can never lead to fewer gold our opinion, this is not necessarily the case, given that trees in the set of candidate parses. The gold tree ratio is we were able to realize small improvements on a lan- constantly at zero for Lithuanian. For Tamil, it is always guage without a single gold tree in the π‘˜-best lists (i.e. near 5%, while for Marathi it is stable at 15%. The only Lithuanian) with non-neural models. In Belarussian, the gold tree ratio that shows significant improvement is for language with the highest gold tree ratio, on the other Belarussian, from approximately 10% to approximately hand, the reranker did not bring any benefit. The rea- 20%. son why Belarussian is hard to rerank might be that the base parser’s predictions are on average already close 6.2. Reranking performance to the true trees. This is indicated by a steeply rising gold tree ratio but only marginal gains in oracle LAS The performance in terms of the LAS of the different scores. Additional similar trees result in a high variance parsing systems is indicated in Table 2. We can see that of the reranker. As a result, the reranker uses strong augmenting the base parser by an interpretable reranker regularization. Still, the reranker performs worse than does not improve parsing performance drastically. the base parser, supposedly because the kernel function Looking at the chosen hyperparameters in Table 3, we is not ideal for capturing the fine-grained differences be- can see that both methods are highly sensitive to the tween these highly similar candidate trees [15]. In the reranking model. Table 2 LAS of different reranking-based parsing systems Lithuanian HSE Belarussian HSE Marathi UFAL Tamil TTB Mean Median Std Mean Median Std Mean Median Std Mean Median Std Base Parser 42.08 42.08 0 81.55 81.55 0 60.44 60.44 0 67.67 67.67 0 SVM MR 42.17 42.17 0 81.55 81.55 0 59.95 59.95 0 67.57 67.57 0 SVM Reject 42.17 42.17 0 81.53 81.53 0 59.95 59.95 0 67.67 67.67 0 GP MR 42.08 42.08 0 81.56 81.55 0.04 60.49 60.44 0.11 67.67 67.67 0 GP Reject 41.87 41.98 0.29 81.55 81.55 0 60.53 60.68 0.22 67.67 67.67 0 Table 3 Optimized hyperparameters of different reranking-based parsing systems Lithuanian HSE Belarussian HSE Marathi UFAL Tamil TTB SVM MR 𝛼 = 1, π‘˜ = 6, 𝑐 = 199 𝛼 = 0.89, π‘˜ = 15, 𝑐 = 199 𝛼 = 1, π‘˜ = 2, 𝑐 = 199 𝛼 = 0.95, π‘˜ = 15, 𝑐 = 199 SVM Reject 𝜏 = 0, π‘˜ = 6, 𝑐 = 199 𝜏 = 0, π‘˜ = 8, 𝑐 = 0.003 𝜏 = 0, π‘˜ = 2, 𝑐 = 199 𝜏 = 0, π‘˜ = 2, 𝑐 = 0.003 GP MR 𝛼 = 0, π‘˜ = 15 𝛼 = 0.89, π‘˜ = 7 𝛼 = 0.95, π‘˜ = 15 𝛼 = 0, π‘˜ = 15 GP Reject 𝜏 = 0.3, π‘˜ = 3 𝜏 = 1, π‘˜ = 15 𝜏 = 0.3, π‘˜ = 2 𝜏 = 1, π‘˜ = 15 case of Tamil, we see a similar pattern. The higher ora- Acknowledgements cle LAS growth and the less steep increase in the gold tree ratio however lead to less similar candidate trees. In We thank Ran Zmigrod for insightful discussions at an this situation, a strongly regularized reranker can repro- early stage of the project. duce the base parser performance, whereas biasing the reranker towards the base parser seems to rather distort the predictions. References Based on the performance in Table 2 and the hyper- [1] S. KΓΌbler, R. McDonald, J. Nivre, Dependency pars- parameter selection in Table 3, we see that both mixing ing, Synthesis lectures on human language tech- methods choose similar hyperparameters and perform nologies 1 (2009) 1–127. almost the same. Generally, the reranker reject option [2] S. R. Bowman, J. Gauthier, A. Rastogi, R. Gupta, works better if the classifier outputs well-calibrated prob- C. D. Manning, C. Potts, A fast unified model abilities. Tuning the hyperparameter 𝜏 might thus be for parsing and sentence understanding, in: Pro- harder than tuning 𝛼, since it is dependent on the quality ceedings of the 54th Annual Meeting of the As- of the reranker’s predictions. sociation for Computational Linguistics (Volume 1: Long Papers), Association for Computational 8. Conclusion Linguistics, Berlin, Germany, 2016, pp. 1466–1477. URL: https://aclanthology.org/P16-1139. doi:10. All in all, we can observe from Table 2 that the gains of 18653/v1/P16-1139. interpretable rerankers (with the kernel of 15) are not [3] G. Angeli, M. J. Johnson Premkumar, C. D. Manning, enough to justify their computational cost. This is in Leveraging linguistic structure for open domain in- line with the recent literature on π‘˜-best reranking [5]. formation extraction, in: Proceedings of the 53rd Our experiments with the π‘˜-best list show a positive Annual Meeting of the Association for Computa- growth of the oracle LAS, meaning that π‘˜-best reranking tional Linguistics and the 7th International Joint provides possibilities to improve on strong base parsers. Conference on Natural Language Processing (Vol- However, the tested models cannot consistently leverage ume 1: Long Papers), Association for Computa- this potential. tional Linguistics, Beijing, China, 2015, pp. 344–354. By proposing a reranker reject option, we pave the way URL: https://aclanthology.org/P15-1034. doi:10. for fully interpretable parsing systems. Similar in per- 3115/v1/P15-1034. formance to mixture reranking [10, 11], it allows a more [4] M. Straka, J. StrakovΓ‘, J. Hajič, Evaluating con- fine-grained analysis of rerankers and base parsers, since textualized embeddings on 54 languages in pos every parsing decision can be traced back to either of tagging, lemmatization and dependency parsing, the parsing system’s components. Future work towards arXiv preprint arXiv:1908.07448 (2019). URL: https: interpretable reranking could involve more sophisticated //arxiv.org/pdf/1908.07448.pdf. kernel functions that are able to better measure the subtle [5] B.-N. Do, I. Rehbein, Neural reranking for differences between the candidate trees. dependency parsing: An evaluation, in: Proceedings of the 58th Annual Meeting N. Aepli, H. Aghaei, Universal dependencies 2.5, of the Association for Computational Lin- 2019. URL: http://hdl.handle.net/11234/1-3105, guistics, Association for Computational Lin- LINDAT/CLARIAH-CZ digital library at the Insti- guistics, Online, 2020, pp. 4123–4133. URL: tute of Formal and Applied Linguistics (ÚFAL), Fac- https://aclanthology.org/2020.acl-main.379. ulty of Mathematics and Physics, Charles Univer- doi:10.18653/v1/2020.acl-main.379. sity. [6] K. Hall, K-best spanning tree parsing, in: Proceed- [14] P. Qi, Y. Zhang, Y. Zhang, J. Bolton, C. D. Man- ings of the 45th Annual Meeting of the Associa- ning, Stanza: A Python natural language process- tion of Computational Linguistics, Association for ing toolkit for many human languages, in: Proceed- Computational Linguistics, Prague, Czech Republic, ings of the 58th Annual Meeting of the Association 2007, pp. 392–399. URL: https://aclanthology.org/ for Computational Linguistics: System Demonstra- P07-1050. tions, 2020. URL: https://aclanthology.org/2020. [7] P. Qi, T. Dozat, Y. Zhang, C. D. Manning, Uni- acl-demos.14.pdf. versal Dependency parsing from scratch, in: Pro- [15] M. Collins, N. Duffy, F. Park, Parsing with a single ceedings of the CoNLL 2018 Shared Task: Multilin- neuron: Convolution kernels for natural language gual Parsing from Raw Text to Universal Depen- problems, Technical report UCSC-CRL-01-01, dencies, Association for Computational Linguistics, University of California at Santa Cruz (2001). Brussels, Belgium, 2018, pp. 160–170. URL: https: URL: http://luthuli.cs.uiuc.edu/~daf/courses/ //aclanthology.org/K18-2016. doi:10.18653/v1/ learning/Kernelpapers/collins01parsing.pdf. K18-2016. [16] D. Zeman, J. Hajič, M. Popel, M. Potthast, M. Straka, [8] R. McDonald, F. Pereira, K. Ribarov, J. Hajič, Non- F. Ginter, J. Nivre, S. Petrov, CoNLL 2018 shared projective dependency parsing using spanning tree task: Multilingual parsing from raw text to Uni- algorithms, in: Proceedings of Human Language versal Dependencies, in: Proceedings of the Technology Conference and Conference on Empiri- CoNLL 2018 Shared Task: Multilingual Parsing cal Methods in Natural Language Processing, Asso- from Raw Text to Universal Dependencies, Associ- ciation for Computational Linguistics, Vancouver, ation for Computational Linguistics, Brussels, Bel- British Columbia, Canada, 2005, pp. 523–530. URL: gium, 2018, pp. 1–21. URL: https://aclanthology. https://aclanthology.org/H05-1066. org/K18-2001. doi:10.18653/v1/K18-2001. [9] R. Zmigrod, T. Vieira, R. Cotterell, On find- [17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, ing the k-best non-projective dependency trees, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, in: Proceedings of the 59th Annual Meet- R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, ing of the Association for Computational Lin- D. Cournapeau, M. Brucher, M. Perrot, E. Duch- guistics and the 11th International Joint Con- esnay, Scikit-learn: Machine learning in Python, ference on Natural Language Processing (Vol- Journal of Machine Learning Research 12 (2011) ume 1: Long Papers), Association for Computa- 2825–2830. URL: https://www.jmlr.org/papers/ tional Linguistics, Online, 2021, pp. 1324–1337. volume12/pedregosa11a/pedregosa11a.pdf. URL: https://aclanthology.org/2021.acl-long.106. [18] GPy, GPy: A gaussian process framework in python, doi:10.18653/v1/2021.acl-long.106. 2012. URL: http://github.com/SheffieldML/GPy. [10] K. Hayashi, S. Kondo, Y. Matsumoto, Ef- ficient stacked dependency parsing by forest reranking, Transactions of the Association for Computational Linguistics 1 (2013) 139–150. URL: https://aclanthology.org/Q13-1012. doi:10. 1162/tacl_a_00216. [11] Q. Le, T. Mikolov, Distributed representations of sentences and documents, in: E. P. Xing, T. Je- bara (Eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, PMLR, Bejing, China, 2014, pp. 1188–1196. URL: http: //proceedings.mlr.press/v32/le14.pdf. [12] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics), Springer-Verlag, Berlin, Heidelberg, 2006. [13] D. Zeman, J. Nivre, M. Abrams, E. Ackermann, A. Appendix A.1. Implementation Details We construct the training set for each of the chosen lan- guages by decoding up to 9 trees from the weighted graph learned by the base parser with the algorithm of Zmigrod et al. [9]. Next, we add the corresponding true parse tree to the list. The π‘˜ is viewed as a hyperparameter and is thus not fixed a priori. We used the implementation of the SVMs from sklearn [17] and the GP implementation of GPy [18]. We set the hyperparameter πœ† of the Collins et al. [15] kernel to 0.7 based on preliminary experiments on the Lithuanian training set. The hyperparameters 𝑐, π‘˜, 𝛼 and 𝜏 are tuned si- multaneously on the development set. The search space for the inverse regularization parameter c is (0.0001, 0.0027, 0.7356, 199.5262). The search space for π‘˜ is 0 < π‘˜ < 16 for SVMs and 1 < π‘˜ < 16 for GPs. The latter is motivated by empirical reasons, i.e. the tendency of the GPs to always stick to the base parser unless they are forced not to. With both methods, the models can always fall back to the base parser if that is considered optimal. For the search space for 𝛼, we generated 20 evenly spaced values in the interval 0 ≀ π‘₯ ≀ 1. Finally, the search space for 𝜏 is (0, 0.3, 0.7, 1). Note that none of the values is too close to 0.5, since we expect high variance in the results if values near 0.5 are used as a cutoff.