<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Florian Schottmann</string-name>
          <email>fschottmann@ethz.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vincent Fortuin</string-name>
          <email>fortuin@inf.ethz.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edoardo Ponti</string-name>
          <email>edoardo-maria.ponti@mila.quebec</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ryan Cotterell</string-name>
          <email>ryan.cotterell@inf.ethz.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, ETH Zurich</institution>
          ,
          <addr-line>Zurich</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Lugano</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Mila AI Centre</institution>
          ,
          <addr-line>Quebec</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Reranking the  best hypothesis parse trees from an existing parser allows to take into account more information that the model has gathered during training than simply decoding the most likely dependency tree. In this paper, we first investigate whether state-of-the-art dependency parsers can still benefit from reranking in low-resource languages. As part of this analysis, we deliver new insights concerning rerankability. Second, we propose a reranker reject option, which paves the way for designing interpretable reranking-based parsing systems in the future.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        stream applications like natural language understanding
tasks [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] or information extraction [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        State-of-the-art dependency parsers predict the most
likely dependency parse tree for a given sentence based
on large neural network architectures [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. By acting
greedily with respect to the prediction, these models
disregard information that was gathered during training.
In particular, the best dependency tree might not have
been assigned the highest probability, for example due to
unusual training sentences [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. A technique that tries to
remedy this problem is  -best reranking. It is based on
decoding the  best predictions from the base parser and
reranking these using an additional machine learning
model [6].
      </p>
      <p>In this paper, we are investigating whether
interpretable rerankers can improve current state-of-the-art
dependency parsers. We challenge previous conclusions
in which situations base parsers can be reranked and
propose a novel approach to combining the base parser
and the reranker.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>Many state-of-the-art dependency parsing systems are
graph-based parsers, which are based on an edge-factored
model (e.g. 4 and 7). The underlying assumption of an
edge-factored model is that the score of a dependency</p>
      <p>) ∈   . If we use the
conditional probability of a dependency tree (given a
particular sentence  ) as a scoring function, we can write
the model as
exp{(  ,  ,   ;  )}</p>
      <p>(1)
( |;
 =
 ) =
∑
1</p>
      <p>∏
 (  ,,  )∈ 
∏</p>
      <p>exp{(  ,  ,   ;  )}
 ′∈ full() (  ,,  )∈  ′
where  full() is the set of all dependency trees for
sentence  .  is the edge scoring function and  is the
parameter vector.</p>
      <p>From the model definition, we can see that any
edgefactored model learns  ×</p>
      <p>edge scores for a sentence
 , where  is the number of tokens in the sentence
augmented by an artificial</p>
      <p>token at the beginning of the
sentence. Decoding the  best dependency trees from
such a score matrix after training is not possible naively,
since the search space is exponential [8].</p>
      <p>Recently, Zmigrod et al. [9] provided an algorithm
which allows to decode the  best dependency trees from
an adjacency matrix induced by an edge-factored model,
which we will use in this work to construct the list of
hypothesis trees.</p>
      <p>Often, building a reranking model on top of the base
parser is not yet enough for state-of-the-art performance.</p>
      <p>
        A popular strategy to improve a reranking model is
mixparser and the reranker with a trade-of parameter that
is tuned on the development set [10, 11]:
  =   
( ,   ) + (1 −  )   ( ,   )
(2)
SwissText 2022: Swiss Text Analytics Conference, June 08–10, 2022, ture reranking (MR), i.e. combining the scores of the base
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License where  ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] is the mixing parameter.  
Attribution 4.0 International (CC BY 4.0).
      </p>
      <p>is the score
of the reranker and   is the score of the base parser given
a tree  .  are the respective parameter vectors and   is
the final score of the tree  .</p>
    </sec>
    <sec id="sec-3">
      <title>3. K-best list</title>
      <p>
        Recently, Do and Rehbein [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] emphasized that, apart
from the oracle parsing accuracy [6], the quality of the
 -best lists of a dataset can be indicated by the gold tree
ratio and the unlabeled attachment scores (UAS) standard
deviation in the  -best list. The gold tree ratio refers to
the number of times a correct tree is in the  -best list
relative to the number of sentences in the respective
dataset. The oracle parsing accuracy is a similar metric
introduced by Hall [6], referring to the maximum score
possible by always choosing the closest tree to the gold
tree in the  -best list.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Reranker reject option</title>
      <p>In this work, we reframe mixture reranking by using a
reject option for the reranker. A reject option is a concept
from decision theory; it refers to the idea of rejecting
a classification decision if the associated probability is
lower than a certain threshold  [12]. In the context of
reranking, the reject option works as follows:
argmax  ( ′| )
 ∗ = {  ′∈ 
 1
if max  ( ′| ) &gt;</p>
      <p>′∈ 
else
(3)
where   is the set of candidate parses for sentence
 and  is the confidence threshold. If the confidence of
the reranker is less than or equal to  , the prediction of
the base parser  1 is used.</p>
      <p>Compared to mixture reranking, our method ofers
a more interpretable way of trading of reranker and
base parser. When using mixture reranking, it is not
clear how much relative weight the reranker and the
base parser get in the final decision of a parsing system
if the base parser’s score is not a probability (unless  ∈
{0, 1}). Indeed, the base parser score of Qi et al. [7] is not
normalized over all valid dependency trees. By using a
reranker reject option, we do not rely on the score of the
base parser: We tune a threshold of minimum certainty
that implicitly trades of reranker and base parser, but
always leads to a clear decision whether the reranker or
the base parser takes the final decision of the parsing
system.</p>
      <p>Treebank
Lithuanian HSE
Belarussian HSE
Marathi UFAL</p>
      <p>Tamil TTB
treebanks [13], since our chosen base parser’s
performance ofers most potential for improvement in the
lowresource domain [14]. In particular, we decided to use
Lithuanian, Belarussian, Marathi, and Tamil. The data
split is indicated in Table 1.</p>
      <p>We use the base parser of Qi et al. [7] and decode the
 -best list by using the algorithm of Zmigrod et al. [9].
As reranking models, we train structured support vector
machine (SVM) and Gaussian process (GP) classification
models on each of the four languages. All models are
kernelized with the kernel of Collins et al. [15], which
measures the similarity of two dependency trees in terms
of the number of subtrees that they have in common.
Thus, the kernel takes structural overlap as a measure of
similarity between the trees [15]. Except for the kernel,
no other features are used. We refrain from using
neuralnetwork-based rerankers to avoid the construction of
black-box features.</p>
      <p>We fix the random seeds to guarantee reproducibility
of our models. The implementation of the GPs (in
particular the computation of the GP posterior) does not allow
full reproducibility of the fitting procedure. To account
for this randomness, we train the GP models over five
diferent seeds.</p>
      <p>
        We tune the inverse regularization parameter  (only
in case of the SVM),  (the number of candidate trees
at prediction time) and  or  simultaneously on the
development set. In case of ties, we take the parameter
combination corresponding to maximum regularization
with the highest number of candidates. That corresponds
to making a conservative decision while maximizing the
diversity of the candidate parses, where the latter has
been suggested to be useful by Do and Rehbein [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. We
choose the hyperparameter combination that achieves
the highest (average) labeled attachment score (LAS) on
the development set.
      </p>
      <sec id="sec-4-1">
        <title>5.2. Evaluation</title>
        <p>The baseline performance is generated by feeding the
pretokenized sentences into the base parser and evaluating
5. Experimental Setup the predictions with the oficial CoNLL 2018 shared task
evaluation script. That leads to considerably higher LAS
5.1. Training scores than reported by Qi et al. [14]. We hypothesize
We ran our experiments on four low-resource lan- that this is the result of the fixed tokenization and the
guages using data from the Universal Dependencies v2.5 ifxed sentence split, given that the CoNLL 2018 shared</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Results</title>
      <sec id="sec-5-1">
        <title>6.1. K-best list</title>
      </sec>
      <sec id="sec-5-2">
        <title>6.2. Reranking performance</title>
        <p>The performance in terms of the LAS of the diferent
parsing systems is indicated in Table 2. We can see that
augmenting the base parser by an interpretable reranker
does not improve parsing performance drastically.</p>
        <p>Looking at the chosen hyperparameters in Table 3, we
can see that both methods are highly sensitive to the
reranking model.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>7. Discussion</title>
      <p>Based on Figure 1, we generally see that even the strong
base parser of Qi et al. [7] can benefit from  -best
reranking. We hypothesize that the gains in oracle LAS of
additional candidates are crucial for reranking since those
were the highest for Lithuanian and Marathi (the only
languages where the base parser could be reranked on
average).</p>
      <p>
        Do and Rehbein [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] state that a prerequisite for
successful neural reranking is a high gold tree ratio. In
our opinion, this is not necessarily the case, given that
we were able to realize small improvements on a
language without a single gold tree in the  -best lists (i.e.
      </p>
      <p>Lithuanian) with non-neural models. In Belarussian, the
language with the highest gold tree ratio, on the other
hand, the reranker did not bring any benefit. The
reason why Belarussian is hard to rerank might be that the
base parser’s predictions are on average already close
to the true trees. This is indicated by a steeply rising
gold tree ratio but only marginal gains in oracle LAS
scores. Additional similar trees result in a high variance
of the reranker. As a result, the reranker uses strong
regularization. Still, the reranker performs worse than
the base parser, supposedly because the kernel function
is not ideal for capturing the fine-grained diferences
between these highly similar candidate trees [15]. In the
Acknowledgements
case of Tamil, we see a similar pattern. The higher
oracle LAS growth and the less steep increase in the gold
tree ratio however lead to less similar candidate trees. In We thank Ran Zmigrod for insightful discussions at an
this situation, a strongly regularized reranker can repro- early stage of the project.
duce the base parser performance, whereas biasing the
reranker towards the base parser seems to rather distort References
the predictions.</p>
      <p>Based on the performance in Table 2 and the
hyperparameter selection in Table 3, we see that both mixing
methods choose similar hyperparameters and perform
almost the same. Generally, the reranker reject option
works better if the classifier outputs well-calibrated
probabilities. Tuning the hyperparameter  might thus be
harder than tuning  , since it is dependent on the quality
of the reranker’s predictions.</p>
    </sec>
    <sec id="sec-7">
      <title>8. Conclusion</title>
      <p>
        All in all, we can observe from Table 2 that the gains of
interpretable rerankers (with the kernel of 15) are not
enough to justify their computational cost. This is in
line with the recent literature on  -best reranking [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>Our experiments with the  -best list show a positive
growth of the oracle LAS, meaning that  -best reranking
provides possibilities to improve on strong base parsers.</p>
      <p>However, the tested models cannot consistently leverage
this potential.</p>
      <p>By proposing a reranker reject option, we pave the way
for fully interpretable parsing systems. Similar in
performance to mixture reranking [10, 11], it allows a more
ifne-grained analysis of rerankers and base parsers, since
every parsing decision can be traced back to either of
the parsing system’s components. Future work towards
interpretable reranking could involve more sophisticated
kernel functions that are able to better measure the subtle
diferences between the candidate trees.
Proceedings of the 58th Annual Meeting N. Aepli, H. Aghaei, Universal dependencies 2.5,
of the Association for Computational Lin- 2019. URL: http://hdl.handle.net/11234/1-3105,
guistics, Association for Computational Lin- LINDAT/CLARIAH-CZ digital library at the
Instiguistics, Online, 2020, pp. 4123–4133. URL: tute of Formal and Applied Linguistics (ÚFAL),
Fachttps://aclanthology.org/2020.acl-main.379. ulty of Mathematics and Physics, Charles
Univerdoi:10.18653/v1/2020.acl-main.379. sity.
[6] K. Hall, K-best spanning tree parsing, in: Proceed- [14] P. Qi, Y. Zhang, Y. Zhang, J. Bolton, C. D.
Manings of the 45th Annual Meeting of the Associa- ning, Stanza: A Python natural language
processtion of Computational Linguistics, Association for ing toolkit for many human languages, in:
ProceedComputational Linguistics, Prague, Czech Republic, ings of the 58th Annual Meeting of the Association
2007, pp. 392–399. URL: https://aclanthology.org/ for Computational Linguistics: System
DemonstraP07-1050. tions, 2020. URL: https://aclanthology.org/2020.
[7] P. Qi, T. Dozat, Y. Zhang, C. D. Manning, Uni- acl-demos.14.pdf.</p>
      <p>versal Dependency parsing from scratch, in: Pro- [15] M. Collins, N. Dufy, F. Park, Parsing with a single
ceedings of the CoNLL 2018 Shared Task: Multilin- neuron: Convolution kernels for natural language
gual Parsing from Raw Text to Universal Depen- problems, Technical report UCSC-CRL-01-01,
dencies, Association for Computational Linguistics, University of California at Santa Cruz (2001).
Brussels, Belgium, 2018, pp. 160–170. URL: https: URL: http://luthuli.cs.uiuc.edu/~daf/courses/
//aclanthology.org/K18-2016. doi:10.18653/v1/ learning/Kernelpapers/collins01parsing.pdf.</p>
      <p>K18-2016. [16] D. Zeman, J. Hajič, M. Popel, M. Potthast, M. Straka,
[8] R. McDonald, F. Pereira, K. Ribarov, J. Hajič, Non- F. Ginter, J. Nivre, S. Petrov, CoNLL 2018 shared
projective dependency parsing using spanning tree task: Multilingual parsing from raw text to
Unialgorithms, in: Proceedings of Human Language versal Dependencies, in: Proceedings of the
Technology Conference and Conference on Empiri- CoNLL 2018 Shared Task: Multilingual Parsing
cal Methods in Natural Language Processing, Asso- from Raw Text to Universal Dependencies,
Associciation for Computational Linguistics, Vancouver, ation for Computational Linguistics, Brussels,
BelBritish Columbia, Canada, 2005, pp. 523–530. URL: gium, 2018, pp. 1–21. URL: https://aclanthology.
https://aclanthology.org/H05-1066. org/K18-2001. doi:10.18653/v1/K18-2001.
[9] R. Zmigrod, T. Vieira, R. Cotterell, On find- [17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
ing the k-best non-projective dependency trees, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
in: Proceedings of the 59th Annual Meet- R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
ing of the Association for Computational Lin- D. Cournapeau, M. Brucher, M. Perrot, E.
Duchguistics and the 11th International Joint Con- esnay, Scikit-learn: Machine learning in Python,
ference on Natural Language Processing (Vol- Journal of Machine Learning Research 12 (2011)
ume 1: Long Papers), Association for Computa- 2825–2830. URL: https://www.jmlr.org/papers/
tional Linguistics, Online, 2021, pp. 1324–1337. volume12/pedregosa11a/pedregosa11a.pdf.
URL: https://aclanthology.org/2021.acl-long.106. [18] GPy, GPy: A gaussian process framework in python,
doi:10.18653/v1/2021.acl-long.106. 2012. URL: http://github.com/SheffieldML/GPy.
[10] K. Hayashi, S. Kondo, Y. Matsumoto,
Efifcient stacked dependency parsing by forest
reranking, Transactions of the Association
for Computational Linguistics 1 (2013) 139–150.</p>
      <p>URL: https://aclanthology.org/Q13-1012. doi:10.</p>
      <p>1162/tacl_a_00216.
[11] Q. Le, T. Mikolov, Distributed representations of
sentences and documents, in: E. P. Xing, T.
Jebara (Eds.), Proceedings of the 31st International
Conference on Machine Learning, volume 32 of
Proceedings of Machine Learning Research, PMLR,
Bejing, China, 2014, pp. 1188–1196. URL: http:
//proceedings.mlr.press/v32/le14.pdf.
[12] C. M. Bishop, Pattern Recognition and Machine</p>
      <p>Learning (Information Science and Statistics),</p>
      <p>Springer-Verlag, Berlin, Heidelberg, 2006.
[13] D. Zeman, J. Nivre, M. Abrams, E. Ackermann,</p>
      <sec id="sec-7-1">
        <title>A.1. Implementation Details</title>
        <p>We construct the training set for each of the chosen
languages by decoding up to 9 trees from the weighted graph
learned by the base parser with the algorithm of Zmigrod
et al. [9]. Next, we add the corresponding true parse tree
to the list. The  is viewed as a hyperparameter and is
thus not fixed a priori.</p>
        <p>We used the implementation of the SVMs from sklearn
[17] and the GP implementation of GPy [18]. We set
the hyperparameter  of the Collins et al. [15] kernel to
0.7 based on preliminary experiments on the Lithuanian
training set.</p>
        <p>The hyperparameters  ,  ,  and  are tuned
simultaneously on the development set. The search
space for the inverse regularization parameter c is
(0.0001, 0.0027, 0.7356, 199.5262).</p>
        <p>The search space for  is 0 &lt;  &lt; 16 for SVMs and
1 &lt;  &lt; 16 for GPs. The latter is motivated by empirical
reasons, i.e. the tendency of the GPs to always stick to
the base parser unless they are forced not to. With both
methods, the models can always fall back to the base
parser if that is considered optimal.</p>
        <p>For the search space for  , we generated 20 evenly
spaced values in the interval 0 ≤  ≤ 1 .</p>
        <p>Finally, the search space for  is (0, 0.3, 0.7, 1). Note
that none of the values is too close to 0.5, since we expect
high variance in the results if values near 0.5 are used as
a cutof.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kübler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>McDonald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nivre</surname>
          </string-name>
          , Dependency parsing,
          <source>Synthesis lectures on human language technologies 1</source>
          (
          <year>2009</year>
          )
          <fpage>1</fpage>
          -
          <lpage>127</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Bowman</surname>
          </string-name>
          , J. Gauthier,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rastogi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Potts</surname>
          </string-name>
          ,
          <article-title>A fast unified model for parsing and sentence understanding, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics</article-title>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Berlin, Germany,
          <year>2016</year>
          , pp.
          <fpage>1466</fpage>
          -
          <lpage>1477</lpage>
          . URL: https://aclanthology.org/P16-1139. doi:
          <volume>10</volume>
          . 18653/v1/
          <fpage>P16</fpage>
          -1139.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Angeli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J. Johnson</given-names>
            <surname>Premkumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>Leveraging linguistic structure for open domain information extraction</article-title>
          ,
          <source>in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Beijing, China,
          <year>2015</year>
          , pp.
          <fpage>344</fpage>
          -
          <lpage>354</lpage>
          . URL: https://aclanthology.org/P15-1034. doi:
          <volume>10</volume>
          . 3115/v1/
          <fpage>P15</fpage>
          -1034.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Straka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Straková</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hajič</surname>
          </string-name>
          ,
          <article-title>Evaluating contextualized embeddings on 54 languages in pos tagging, lemmatization and dependency parsing</article-title>
          , arXiv preprint arXiv:
          <year>1908</year>
          .
          <volume>07448</volume>
          (
          <year>2019</year>
          ). URL: https: //arxiv.org/pdf/
          <year>1908</year>
          .07448.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.-N.</given-names>
            <surname>Do</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Rehbein</surname>
          </string-name>
          ,
          <article-title>Neural reranking for dependency parsing: An evaluation</article-title>
          , in:
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>