=Paper= {{Paper |id=Vol-2936/paper-211 |storemode=property |title=Retrieving Comparative Arguments using Ensemble Methods and BERT |pdfUrl=https://ceur-ws.org/Vol-2936/paper-211.pdf |volume=Vol-2936 |authors=Viktoriia Chekalina,Alexander Panchenko |dblpUrl=https://dblp.org/rec/conf/clef/ChekalinaP21 }} ==Retrieving Comparative Arguments using Ensemble Methods and BERT== https://ceur-ws.org/Vol-2936/paper-211.pdf
    Retrieving Comparative Arguments using
Ensemble Methods and Neural Information Retrieval
                                            Notebook for the Touche Lab on Argument Retrieval at CLEF 2021

Viktoriia Chekalina1,2 , Alexander Panchenko1
1
    Skolkovo Institute of Science and Technology, Moscow, Russian Federation
2
    Philips Innovation Lab Rus, Moscow, Russian Federation


                                         Abstract
                                         In this paper, we present a submission to the Touché lab’s Task 2 on Argument Retrieval for Comparative
                                         Questions [1, 2]. Our team Katana supplies several approaches based on decision tree ensembles algo-
                                         rithms to rank comparative documents in accordance with their relevance and argumentative support.
                                         We use PyTerrier [3] library to apply ensembles models to a ranking problem, considering statistical text
                                         features and features based on comparative structures. We also employ large contextualized language
                                         modelling techniques, such as BERT [4], to solve the proposed ranking task. To merge this technique
                                         with ranking modelling, we leverage neural ranking library OpenNIR [5].
                                             Our systems substantially outperforming the proposed baseline and scored first in relevance and
                                         second in quality according to the official metrics of the competition (for measure NDCG@5 score). Pre-
                                         sented models could help to improve the performance of processing comparative queries in information
                                         retrieval and dialogue systems.

                                         Keywords
                                         comparative argument retrieval, natural language processing, neural information retrieval




1. Introduction
On a daily basis, people face the problem of choosing between two entities - which phone is
more reliable, which juice contains less sugar, which hotel is better for a holiday. Domain-
specific comparison systems, like WolframAlpha or Diffen, solve this problem partly and rely
on structured data, which limits the number of cases it can be used.
   On the other hand, the Web contains a vast number of opinions and objective arguments that
can facilitate the comparative decision-making process. It creates the need of developing an
open-domain general system that could process such information. The issue is to retrieve from a
set of documents relevant, supportive and credible arguments. The aim of the proposed work is
to retrieve from ClueWeb12 1 corpus documents and re-rank them, considering argumentation
for or against one option or the other.
   The contribution of our work is the following: we are first to use ensemble methods based
on mixed statistical and comparative features to the document ranking; we are first to use
neural information retrieval approach to the task of argument retrieval; we propose a model

CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings           CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073




                  1
                      http://lemurproject.org/clueweb12
outperforming the baseline and yielding the first and the second-best result according to the
relevance and quality metric, respectively.


2. Related work
The most relevant to this work is the previous shared task Touche 2020 [6]. 17 participants
took part in the competition and submitted 41 runs. Various approaches were tested by these
participants, including methods based on extraction of structures corresponding to claims
and premises, assessing argument quality, representation of documents by language models,
expansion of the query by similar words. The ranking function from search engine ChatNoir [7]
based on BM25F [8] approach was used as a baseline.
   Only a few of the submitted solutions can slightly improve the baseline. The best overall
approach in the previous competition was the method based on query extension and reranking
documents by relevance, credibility, and supportive quality [9].
   This work is based on our run submitted in the previous version of the Touche shared task [10].
In this work, we used a pre-trained language model to find relevance between the query and
document. Extraction of comparative structures and counting the number of comparative
sentences in a document help us to assess the quality of relevant arguments.
   Therefore, the problem of argument retrieval arises in other scenarios. Comparative argumen-
tation machine CAM [11] retrieves comparative sentences with respect to accepted objects and
comparison aspects. The paper [12] explores the influence of context on an argument detecting
system and proves the performance increasing related to it.


3. Datasets and experimental design
3.1. Datasets
The organizers provided 50 comparative questions (topics), for which we should obtain docu-
ments containing convincing arguments for or against one or another option. Topics for the
competition are available online. 2
  In addition, 50 topics and corresponding relevance annotations of the previous year’s com-
petition [13] were given for supervised learning. These documents were also retrieved from
ChatNoir and ranked manually to 0 (not relevant), 1 (relevant) or 2 (highly relevant) scores. We
use this data to train and set up models based on the decision trees and fine-tune the BERT
ranker. Besides, last year’s teams submissions were available too.
  Unfortunately, this data is not insufficient for fitting large supervised ranking models, for
example, based on the BERT technique. In this case we use adjacent question-answering dataset
called Antique [14]. This dataset consists of the questions and answers of Yahoo! Webscope L6
and contains 2,626 open-domain non-factoid questions and 34,011 manual relevance annotations.
  The example of query and ranked answers are in Table 6, Table 7 in Appendix A. It might be
noticed that Antique dataset has a different set of ranking scores - 0, 1, 2 instead of 1, 2, 3, 4 - so
we rewrite Antique ranks in accordance with the following mapping 1→0, 2→1, 3→1, 4→2.
    2
        https://webis.de/events/touche-21/shared-task-2.html
3.2. Evaluation setup
We use every topic as a query in ChatNoir [7] search engine and extract up to 100 unique
documents from the ClueWeb12 corpus. We clean documents’ bodies from HTML tags and
markups and ranked them using one of the developed approaches described below.
   As auxiliary data, the organizers provided the topics of the previous year’s competition. For
each proposed topic, a set of documents from ChatNoir was retrieved and labelled as described
above. We use this data to train developing models and valid composing approaches. In the
validation phase, we split the ranked data into 40 topics in train and 10 in validation.
   In the run phase, we execute produced solutions on web evaluation platform Tira [15]. In this
stage to fit the model we use ranked data from the previous year entirely and predict rank for
current proposed topics. The runs were evaluated using the NDCG metrics based on the human
judgements of the submitted runs. Retrieved documents were judged in accordance with two
criteria: (i) document relevance, (ii) whether sufficient argumentative support is provided [16].


4. Document ranking using ensembles of trees
In this section, we use ensembles of trees as a supervised machine learning technique to solve
ranking problems. We choose either pointwise regression tree algorithms, like Random Forest,
or boosted tree algorithms like XGBoost and LightGBM. In the cases of LightGBM model we
employ LambdaMART [17] objective. It combines cost function derived from minimizing the
number of inversions in ranking (LambdaRank [18]) and objective for building gradient boosted
decision trees (MART [19]). We use PyTerrier platform for information retrieval.3 It simplifies
the extraction of the text features and allows expressing retrieval experiments [20].

4.1. Feature extraction
For our ranking ML methods, we use features that came from 3 origins described below: (i)
ranking features extracted by PyTerrier, (ii) specific comparative features, (iii) score from
ChatNoir system based on custom BM25 scoring function.4

4.1.1. Features extracted by PyTerrier
PyTerrier provides measure of matching query-document texts by several models. Among these
models are statistical measures (TF-IDF), mesures based on language models (Heimstra, Diriclet),
measures based on occurrence of a document depending on the fields that the term occurs
in (BM25F, PLF). The list of all possible models are available at the cite 5 . Among these varieties
we have chosen BM25, Heimstra, DFIC, DPH, TF-IDF, DiricletLM, PL2 for our exploration.
   We applied each of the selected methods sequentially and independently to the training set,
ranked documents by the obtained scores and evaluated the ranking on the validation set. The
result of these tests is in Table 1. We have chosen 3 methods with the most promising results,
and these 3 methods combine 3 features.
    3
      https://pyterrier.readthedocs.io/en/latest/index.html
    4
      https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html
    5
      http://terrier.org/docs/current/javadoc/org/terrier/matching/models/package-summary.html
Table 1
Results on validation set for text features in PyTerrier models.
          Method       BM25      Heimstra     DFIC      DPH        TF-IDF   DiricletLM   PL2
         NDCG@5        0.3637     0.3616      0.3642    0.3110     0.3637     0.3307     3703


4.1.2. Comparative features
We focus not only on finding high relevant documents as on finding documents with a compari-
son of one object relative to another. The work [21] assumes that the comparative issue can
be represented by comparative structures - objects for comparison, comparative aspects and
predicates. We take the sequence-labelling model suggested in the cited paper and applied it
to the query. It helps us to define objects for comparison for every topic. Then we apply the
model to document and get a comparative feature set.
   The feature is_retrieved describes are there any comparative structures in the document
at all. Characteristic objs_score defines how many objects from query are found in document
(0, 1 or 2). Feature asp_pred_score is counted in the following way: if at least one object from a
query is in the document, every word in the document labelled as an aspect or predicate increases
the score to 0.5. Finally, we combined defined features with scores obtained from the ChatNoir
system, and a resulting feature vector for pair query-document is {score_pl2, score_tf,
score_bm, score_dfic, baseline_scores, is_retrieved, ap_score, objs_score}.


4.2. Models
4.2.1. Random Forest
We use the Random Forest model imported from Sklearn and wrapped by the PyTerrier pipeline.
To find the best setup, we vary the number of estimators from 10 to 150, the value 20 gives the
best valid score NDCG@5 of 0.408.

4.2.2. XGBoost
We also wrapped gradient boosting library from Sklearn to PyTerrier class and tune hyperpa-
rameters by setting the learning rate from 1e−4 to 0.1 and max_depth from 4 to 16. The best
setup is learning rate 0.01, max_depth 6 and gives NDCG@5 0.547.

4.2.3. LightGBM
In the case of LightGBM, we vary the number of leaves from 8 to 20 and the learning rate from
0.001 to 0.1. The best configuration with num_leaves equal to 15 and a learning rate equal to 0.1
gives 0.579 score.
   The feature importance of the resulting model is in Table 2. It can be seen that the most
significant feature is the score retrieved from the ChatNoir, then there is a Divergence from
Independence based on Chi-square [22] and the existence of comparison objects in the document.
Table 2
Feature importance in the proposed LightGBM model
       Feature      Pl2   TF-IDF    BM25     Dfic     ChatNoir   has comp   objs_score   asp_pred
   Importance      1.76     1.19     1.51     2.3       20.8        0          1.66        1.51


5. Document ranking using neural information retrieval based
   on BERT
Contextualized language models such as BERT can be much more efficient for ranking tasks
because they contain vast relationships between language units. In the proposed work we use a
reranking model from OpenNIR [5]6 based on “Vanilla” Transformer architecture [23].

5.1. Text representation
BERT receives a query and document and processes it jointly. A distinctive feature of the BERT
reranker is injection token similarity matrices on each layer, which considerably improves
performance [24].

5.2. Training process
First, we pretrain this reranker on the Antique dataset. We clean this dataset from incorrect
symbols and makeups. We also left from the dataset documents of length more than 300
characters, since the length of the ChatNoir retrieves usually does not exceed 300. The training
process lasted for 500 epochs with 0.001 learning rate and 56 objects in every batch. Finally, our
model gives NDCG@5 0.3362 on a validation set. We fine-tune the model on 40 train topics
from the previous year for 50 epochs with the same configuration. Fine-tuning increased the
score on validation up to 0.412.


6. Results
6.1. Results on validation set
The result for every proposed approach obtained on the validation part of data from the previous
year competition is in Table 3. We also evaluate the previous year’s baseline on the validation
set. The best scores come from the LightGBM model, which also outperforms the baseline.
XGBoost has fewer scores, Random Forest as a simple algorithm has the smallest score. Bert
overtakes Random Forest a little.
  In the right column, we also added the time required to train each model. It can be seen
that the ensemble-based models have approximately the same time complexity, while the Bert
requires much more time to train.


   6
       https://github.com/Georgetown-IR-Lab/OpenNIR
Table 3
Results on validation set.
                              Method           NDCG@5       Time, ms
                              Random Forest       0.408      127.168
                              XGBoost             0.547      128.848
                              LightGBM            0.572      131.244
                              Bert Ranker         0.412      1560.947
                              Baseline’20         0.534         -


6.2. Results on test set
For final testing, the retrieved documents were labelled manually with a score from 0 to 3.
Judgment was carried out in two independent criteria: the relevance of the document to the
given topic and the quality of the text. Quality criterion includes good language styling, easy
reading and proper sentence structure, the absence of typos and alliteration.
   For each criterion, a separate file with the assessor’s scores is available. The results of two
evaluations are presented in the Table 4 and Table 5. The runs of our team Katana have the best
result between all teams in terms of relevance and the second result in terms of the text quality.
   As in the validation set, XGBoost and LightGBM give the best performance. It is well explained,
since the loss of these models based on the ranking quality functions, NDCG in the XGBoost
case and LambdaMART in the LightGBM case. The first model describes relevance a bit better
(0.489) and has first place among the whole participant. For quality, conversely, LightGBM is
better. It archives 0.684 and takes second place in a quality table, slightly surrendering to Top 1.
The random forest method has scores just below the baseline in both cases. It can be explained
by a more elementary algorithm for building an ensemble. Bert gives a quite good result for
quality and weak for relevance. Perhaps the data from the adjacent task (factoid QA) used for
the training is the reason for not a very accurate solution.

Table 4                                        Table 5
NDCG@5 scores on runs for relevance for Katana NDCG@5 scores on runs for quality for Katana
team, baseline and Top-2 approach              team, baseline and Top-1 approach
          Method              NDCG@5                      Method                NDCG@5
          Random Forest         0.393                     Random Forest            0.630
          XGBoost (Top 1)       0.489                     XGBoost                  0.675
          LightGBM              0.460                     LightGBM (Top 2)         0.684
          Bert Ranker           0.091                     Bert Ranker              0.466
          ChatNoir baseline     0.422                     ChatNoir baseline        0.636
          Thor team (Top 2)     0.478                     Rayla team (Top 1)       0.688
7. Conclusion
In this paper, we present our solution to the Argument retrieval shared task. We pay attention to
ensembles methods and use statistic approaches, language modelling and comparative structure
extraction to retrieve features for it. We also use a neural reranker based on the Bert technique
to use information from a contextualized model in our task.
   The best results were obtained by gradient boosting methods, training on ranking cost
function: XGBoost and LightGBM. The proposed approaches outperform baseline and take first
and second places in relevance and quality ranking, respectively. Bert contextualized model
shows the need for large learning data.
Acknowledgments
This work is partially supported by the project “ACQuA: Answering Comparative Questions
with Arguments” (grants BI 1544/7-1 and HA 5851/2-1) as part of the priority program “RATIO:
Robust Argumentation Machines” (SPP 1999). We thank Maik Frobe for providing the support
of the software runs in the TIRA system.


Appendix A: Examples of training data

Table 6
Example of query and document with different relevance in Touche task dataset
 Query                  Document                                                                     Rank
 What is better         Disease and condition content is reviewed by our medical review board        2
 for the environment,   real or artificial? There is so much confusing information out there about
 a real or a fake       which is better for your health and the environment.
 Christmas tree?
                        You may think you’re saving a tree, but the plastic alternative has prob-    1
                        lems too. Which is “greener” an artificial Christmas tree or a real one?
                        This entry is part 25 of 103 in the series eco-friendly friday november      0
                        28th’s tip christmas trees: stuck between choosing a real Christmas tree
                        or a fake one?



Table 7
Example of query and document with different relevance in Antique dataset
 Query                  Document                                                                     Rank
 Why do we put the     They are saxon words. Knife would have been pronounced ker-niff.              4
 letter k on the words As a guess I would say that historically “kn” would have been pro-            3
 knife and knob, knee? nounced differently to “n” and that time has altered the way the words
                       are pronounced.
                       Because English is a funny language.                                          2
                       I don’t really (k)now!                                                        1




Appendix B: Examples of ranking results
In this appendix you can find examples Top-3 ranked documents in accordance to LightGBM
and Baseline approaches.
Table 8
Example of documents with the different relevance to query “Is admission rate in Stanford higher than
that of MIT?”
                           Is admission rate in Stanford higher than that of MIT?
 LightGBM Top-3                                           Baseline Top-3
 1. Stanford and Harvard have a similar admissions        1. Stanford and Harvard have a similar admissions
 rate of about 7%. MIT comes with a somewhat              rate of about 7%. MIT comes with a somewhat
 greater rate of success admitting just under 10% or      greater rate of success admitting just under 10% or
 1742 for the class of 2015. Harvard, Stanford and        1742 for the class of 2015. Harvard, Stanford and
 MIT are global leaders in culture, commerce and          MIT are global leaders in culture, commerce and
 governmental policies.                                   governmental policies
 2. For more than a decade, i have served as an ad-       2. For more than a decade, i have served as an ad-
 missions officer for MIT. In that time, i’ve read more   missions officer for MIT. In that time, i’ve read more
 than 10,000 applications and have watched thou-          than 10,000 applications and have watched thou-
 sands of new students enter MIT. It is a privilege to    sands of new students enter MIT. It is a privilege to
 work at the most dynamic and exciting university         work at the most dynamic and exciting university
 in the world.                                            in the world.
 3. Our primary enhancement was targeted at fami-         3. All of this factual information, plus a lot of other
 lies earning less than $75,000 — making mit tuition      detail, can be found in the mit admissions literature.
 free and eliminating                                     In fact, this year, mit will award $74 million in un-
                                                          dergraduate aid.


Table 9
Example documents with the different relevance to query “Which smartphone has a better battery life:
Xperia or iPhone?”
                       Which smartphone has a better battery life: Xperia or iPhone?
 LightGBM Top-3                                           Baseline Top-3
 1. 1. The power saver app that will turn down set-       1. The iPhone 4 is apple’s thinnest smartphone yet,
 tings when battery life is low to get as much juice      but offers a much better screen, faster processor,
 out of the battery as possible. Sony has set the         video calling, and many other enhancements.
 benchmark with its 12 megapixel camera inside the
 Xperia S.
 2. How to increase the battery life of apple’s iPhone    2. Sony Xperia’s review: an above average smart-
 4s many of those with an iphone 4s have complaints       phone gizmotraker’, as far as battery life is con-
 about the battery life. Apple has acknowledged           cerned, it last about 7 hr 30 min in talktime, 450 hrs
 these problems, and is working to fix them.              in standby.
 3. Sony Ericsson includes an 8gb card in the sales       3. How to increase the battery life of Apple’s Iphone
 package the Sony Ericsson Xperia arc s has below         4s many of those with an iphone 4s have complaints
 average battery life. Most users will get around 24      about the battery life. Apple has acknowledged
 hours of life out of the Xperia. X27’s 1600mah bat-      these problems, and is working to fix them.
 tery before it needs a recharge, but heavy users may
 need an injection of power before then.
References
[1] A. Bondarenko, L. Gienapp, M. Fröbe, M. Beloucif, Y. Ajjour, A. Panchenko,
    C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen,                      Overview of
    Touché 2021: Argument Retrieval, in: D. Hiemstra, M.-F. Moens, J. Mothe, R. Perego,
    M. Potthast, F. Sebastiani (Eds.), Advances in Information Retrieval. 43rd European Con-
    ference on IR Research (ECIR 2021), volume 12036 of Lecture Notes in Computer Science,
    Springer, Berlin Heidelberg New York, 2021, pp. 574–582. URL: https://link.springer.com/
    chapter/10.1007/978-3-030-72240-1_67. doi:10.1007/978-3-030-72240-1\_67.
[2] A. Bondarenko, L. Gienapp, M. Fröbe, M. Beloucif, Y. Ajjour, A. Panchenko,
    C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen,                      Overview of
    Touché 2021: Argument Retrieval, in: G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi (Eds.),
    Working Notes Papers of the CLEF 2021 Evaluation Labs, CEUR Workshop Proceedings,
    2021.
[3] S. MacAvaney, C. Macdonald, N. Tonellotto, IR from Bag-of-words to BERT and Beyond
    through Practical Experiments: An ECIR 2021 tutorial with PyTerrier and OpenNIR, in:
    Proceedings of the 43rd European Conference on Information Retrieval Research, 2021, pp.
    728–730.
[4] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
    transformers for language understanding, in: Proceedings of the 2019 Conference of
    the North American Chapter of the Association for Computational Linguistics: Human
    Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
    Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://www.aclweb.org/
    anthology/N19-1423. doi:10.18653/v1/N19-1423.
[5] S. MacAvaney, OpenNIR: A Complete Neural Ad-Hoc Ranking Pipeline, in: J. Caverlee, X. B.
    Hu, M. Lalmas, W. Wang (Eds.), WSDM ’20: The Thirteenth ACM International Conference
    on Web Search and Data Mining, Houston, TX, USA, February 3-7, 2020, ACM, 2020, pp. 845–
    848. URL: https://doi.org/10.1145/3336191.3371864. doi:10.1145/3336191.3371864.
[6] A. Bondarenko, M. Fröbe, M. Beloucif, L. Gienapp, Y. Ajjour, A. Panchenko, C. Biemann,
    B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2020: Argument
    Retrieval, 2020, pp. 384–395. doi:10.1007/978-3-030-58219-7_26.
[7] M. Potthast, M. Hagen, B. Stein, J. Graßegger, M. Michel, M. Tippmann, C. Welsch, Chat-
    Noir: A Search Engine for the ClueWeb09 Corpus, in: B. Hersh, J. Callan, Y. Maarek,
    M. Sanderson (Eds.), 35th International ACM Conference on Research and Development in
    Information Retrieval (SIGIR 2012), ACM, 2012, p. 1004. doi:10.1145/2348283.2348429.
[8] S. E. Robertson, H. Zaragoza, M. J. Taylor, Simple BM25 extension to multiple weighted
    fields, in: D. A. Grossman, L. Gravano, C. Zhai, O. Herzog, D. A. Evans (Eds.), Proceed-
    ings of the 2004 ACM CIKM International Conference on Information and Knowledge
    Management, Washington, DC, USA, November 8-13, 2004, ACM, 2004, pp. 42–49. URL:
    https://doi.org/10.1145/1031171.1031181. doi:10.1145/1031171.1031181.
[9] T. Abye, T. Sager, A. J. Triebel, An open-domain web search engine for answering com-
    parative questions, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), Working
    Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece,
    September 22-25, 2020, volume 2696 of CEUR Workshop Proceedings, CEUR-WS.org, 2020.
     URL: http://ceur-ws.org/Vol-2696/paper_130.pdf.
[10] V. Chekalina, A. Panchenko, Retrieving comparative arguments using deep pre-trained
     language models and NLU, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.),
     Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki,
     Greece, September 22-25, 2020, volume 2696 of CEUR Workshop Proceedings, CEUR-WS.org,
     2020. URL: http://ceur-ws.org/Vol-2696/paper_210.pdf.
[11] M. Schildwächter, A. Bondarenko, J. Zenker, M. Hagen, C. Biemann, A. Panchenko, An-
     swering comparative questions: Better than ten-blue-links?, in: L. Azzopardi, M. Halvey,
     I. Ruthven, H. Joho, V. Murdock, P. Qvarfordt (Eds.), Proceedings of the 2019 Conference
     on Human Information Interaction and Retrieval, CHIIR 2019, Glasgow, Scotland, UK,
     March 10-14, 2019, ACM, 2019, pp. 361–365. URL: https://doi.org/10.1145/3295750.3298916.
     doi:10.1145/3295750.3298916.
[12] M. Fromm, E. Faerman, T. Seidl, TACAM: topic and context aware argument mining (2019)
     99–106. URL: https://doi.org/10.1145/3350546.3352506. doi:10.1145/3350546.3352506.
[13] A. Bondarenko, M. Fröbe, M. Beloucif, L. Gienapp, Y. Ajjour, A. Panchenko, C. Biemann,
     B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2020: Argument
     Retrieval, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), Working Notes Papers
     of the CLEF 2020 Evaluation Labs, volume 2696 of CEUR Workshop Proceedings, 2020. URL:
     http://ceur-ws.org/Vol-2696/.
[14] H. Hashemi, M. Aliannejadi, H. Zamani, W. B. Croft, ANTIQUE: A non-factoid question an-
     swering benchmark 12036 (2020) 166–173. URL: https://doi.org/10.1007/978-3-030-45442-5_
     21. doi:10.1007/978-3-030-45442-5\_21.
[15] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture,
     in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World, The
     Information Retrieval Series, Springer, Berlin Heidelberg New York, 2019. doi:10.1007/
     978-3-030-22948-1\_5.
[16] L. Braunstain, O. Kurland, D. Carmel, I. Szpektor, A. Shtok, Supporting human an-
     swers for advice-seeking questions in CQA sites, in: N. Ferro, F. Crestani, M. Moens,
     J. Mothe, F. Silvestri, G. M. D. Nunzio, C. Hauff, G. Silvello (Eds.), Advances in In-
     formation Retrieval - 38th European Conference on IR Research, ECIR 2016, Padua,
     Italy, March 20-23, 2016. Proceedings, volume 9626 of Lecture Notes in Computer Sci-
     ence, Springer, 2016, pp. 129–141. URL: https://doi.org/10.1007/978-3-319-30671-1_10.
     doi:10.1007/978-3-319-30671-1\_10.
[17] Q. Wu, C. J. Burges, K. M. Svore, J. Gao, Adapting bboosting for information retrieval
     measures, Information Retrieval 13 (2010) 254–270. URL: https://www.microsoft.com/
     en-us/research/publication/adapting-boosting-for-information-retrieval-measures/.
[18] C. Burges, R. Ragno, Q. Le, Learning to Rank with Nonsmooth Cost Functions, in:
     B. Schölkopf, J. Platt, T. Hoffman (Eds.), Advances in Neural Information Processing
     Systems, volume 19, MIT Press, 2007. URL: https://proceedings.neurips.cc/paper/2006/file/
     af44c4c56f385c43f2529f9b1b018f6a-Paper.pdf.
[19] J. Friedman, Stochastic Gradient Boosting, Computational Statistics & Data Analysis 38
     (2002) 367–378. doi:10.1016/S0167-9473(01)00065-2.
[20] C. Macdonald, N. Tonellotto, Declarative Experimentation in Information Retrieval using
     PyTerrier, in: K. Balog, V. Setty, C. Lioma, Y. Liu, M. Zhang, K. Berberich (Eds.), ICTIR ’20:
     The 2020 ACM SIGIR International Conference on the Theory of Information Retrieval,
     Virtual Event, Norway, September 14-17, 2020, ACM, 2020, pp. 161–168. URL: https:
     //dl.acm.org/doi/10.1145/3409256.3409829.
[21] V. Chekalina, A. Bondarenko, C. Biemann, M. Beloucif, V. Logacheva, A. Panchenko, Which
     is better for deep learning: Python or MATLAB? Answering Comparative Questions in
     Natural Language, in: Proceedings of the 16th Conference of the European Chapter
     of the Association for Computational Linguistics: System Demonstrations, Association
     for Computational Linguistics, Online, 2021, pp. 302–311. URL: https://www.aclweb.org/
     anthology/2021.eacl-demos.36.
[22] I. Kocabas, B. Dincer, B. Karaoğlan, A Nonparametric Term Weighting Method for Infor-
     mation Retrieval Based on Measuring the Divergence from Independence, Information
     Retrieval (2013) 1–24. doi:10.1007/s10791-013-9225-4.
[23] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser,
     I. Polosukhin, Attention is all you need, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
     R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing
     Systems, volume 30, Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/
     paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
[24] S. MacAvaney, A. Yates, A. Cohan, N. Goharian, CEDR: Contextualized Embeddings
     for Document Ranking (2019) 1101–1104. URL: https://doi.org/10.1145/3331184.3331317.
     doi:10.1145/3331184.3331317.