=Paper=
{{Paper
|id=Vol-2936/paper-206
|storemode=property
|title=Exploring Argument Retrieval for Controversial Questions Using Retrieve and Re-rank
                        Pipelines
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-206.pdf
|volume=Vol-2936
|authors=Raunak Agarwal,Andrei Koniaev,Robin Schaefer
|dblpUrl=https://dblp.org/rec/conf/clef/AgarwalKS21
}}
==Exploring Argument Retrieval for Controversial Questions Using Retrieve and Re-rank
                        Pipelines==
<pdf width="1500px">https://ceur-ws.org/Vol-2936/paper-206.pdf</pdf>
<pre>
                      Exploring Argument Retrieval for Controversial
                      Questions Using Retrieve and Re-rank Pipelines
                                            Notebook for the Touché Lab on Argument Retrieval at CLEF 2021

Raunak Agarwal1 , Andrei Koniaev1 and Robin Schaefer1
1
    Department of Linguistics, University of Potsdam, 14476 Potsdam, Germany


                                         Abstract
                                         This notebook documents Team Macbeth’s contribution to the CLEF 2021 shared task Touché: Argument
                                         Retrieval for Controversial Questions. Our approach consists of different configurations of a two-step re-
                                         trieve and re-rank pipeline. We experimented with sparse and dense approaches for argument retrieval
                                         and trained query-document cross-encoders for argument re-ranking. Our findings suggest that a sparse
                                         retriever combined with a custom re-ranker performed the best out of all our approaches.

                                         Keywords
                                         Argument Retrieval, Sentence Embeddings, Semantic Search


1. Introduction
In this notebook, we present our approaches for Touché 2021 Task 1: Argument Retrieval for
Controversial Questions [1]. The task entails retrieval of arguments on a focused document
collection crawled from debate portals [2]. The aim here is to assist users with finding "strong"
arguments that support or oppose their position on a given controversial topic.
   We discuss some of the recent work on information retrieval (IR) methods (Section 2), after
which we present a high-level overview of our experiments (Section 3), as well as results (Section
4). For better reproducibility of our experiments, we also make available the source code1 and
the trained models2 .


2. Related Work
While standard information retrieval systems have focused largely on sparse bag-of-words-
based approaches such as BM25, recent trends in IR indicate the performant nature of a two-step
retrieval and re-ranking pipeline, where a sizeable number of candidate documents are first
retrieved using the aforementioned sparse representations, and then re-ranked using (trainable)
neural models [3].

CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" ragarwal@uni-potsdam.de (R. Agarwal); koniaev@uni-potsdam.de (A. Koniaev);
robin.schaefer@uni-potsdam.de (R. Schaefer)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings         CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073


                  1
                    https://github.com/raunak-agarwal/touche-2021-task-1
                  2
                    https://huggingface.co/ragarwal
   Attempts are also being made to get rid of sparse representations altogether through the use
of dense retrieval systems [4]. A standard dense retrieval architecture comprises a transformer-
based encoder, which is fine-tuned on a given training corpus with queries and relevant docu-
ments. The encoded documents are usually added into an inverted index based on approximate
nearest neighbours. There is also work which shows that combining sparse and dense represen-
tations can further enhance the performance of these IR systems [5].
   Our submissions for the Touché shared task center around the above methods.


3. Experiments
3.1. Experimental Setup
All our experiments were computed on a setup comprising of an Intel Xeon E5-2650 CPU (24
cores, 256 GB RAM) and 2 NVIDIA GTX 1080Ti GPU’s (24 GB VRAM). We also used Weights
and Biases3 to track our experiments.

3.2. Pre-training
We pre-trained the entire args.me corpus on a Masked Language Modeling (MLM) task intro-
duced first by BERT [6] and later modified by Liu et al. in RoBERTa [7]. RoBERTa demonstrated
an improvement on BERT’s performance with a small adaptation to the pre-training task, hence
we chose to follow their approach.
   Our motivation for pre-training was to make sure that our model first learns from the domain-
invariant representations present in RoBERTa-base, and then enhances these representations
through (continued) pre-training on our custom domain. This kind of domain-adaptive pre-
training has been known to offer gains in task performance [8].
   We used the hyper-parameters presented in the RoBERTa-base model and trained it for 10
epochs4 , generating a domain-specific RoBERTa-base model with perplexity ≈ 4.1.

3.3. Re-annotation
The organisers of Touché 2021 provide the participants with 2298 relevance judgements to allow
training/evaluation of their systems. These relevance judgments are the result of crowd-sourcing
efforts of Mechanical Turk5 workers - a practice which has been criticised for its questionable
data quality [9], leaving aside major ethical considerations concerning labour exploitation [10].
   Our initial plan was to use these annotations to train a sentence-pair classifier. After a closer
look, however, we found that these annotations were riddled with errors and therefore, not
suitable as a training set.
   Instead of eliminating their use altogether, we decided to re-annotate all of the 2298 relevance
judgements.6 We went through two rounds of annotation for each query-document pair, and
achieved the following metric for inter-annotator agreement: Krippendorff’s alpha = 0.39
   3
     https://wandb.ai/
   4
     https://wandb.ai/ragabet/’roberta-base’
   5
     https://www.mturk.com/
   6
     The annotations are available on our git repository.
Figure 1: Pair-wise scoring architectures (Figure taken from the SBERT website [11])


   Due to time constraints, the runs we submitted were trained only on the first round of
annotations. The relatively low inter-annotator score suggests our runs would’ve turned out
slightly different had we trained our models on an average of the two annotation rounds.

3.4. Sentence Embeddings
When it was first introduced, BERT set new state-of-the-art results on various NLP tasks,
including question answering, sentence classification, and sentence-pair regression. A big
disadvantage of the BERT’s network structure, however, was its inability to generate sentence
embeddings based on single-input sequences.
   To overcome the above issue, we used UKP Lab’s Sentence-BERT (or SBERT) [12] which is a
modification of the standard BERT architecture. SBERT adds a mean pooling operation on top
of the contextualized word vectors generated by BERT/RoBERTa. This enables the generation
of semantically meaningful sentence/document embeddings which can be used for downstream
tasks. We made use of the regression objective function described in their paper. A pair-wise
regressor was trained using cosine-similarity between the two embeddings 𝑢 and 𝑣 (where 𝑢 is
the query embedding and 𝑣 is the document embedding). The objective function was optimized
using mean-squared-error loss.
   The terminology used in SBERT is further refined by Humeau et al. [13] where the following
approaches for pair-wise sentence scoring are defined: Bi-Encoders and Cross-Encoders. (See
Figure 1).

3.4.1. Bi-Encoder
The architecture introduced in SBERT is what is now known as a bi-encoder. Using a bi-encoder,
each sentence can be encoded into an independent sentence embedding. The creation of these
vector representations enables efficient document retrieval through the use of standard similarity
measures (such as Euclidean distance/cosine-similarity) in the embedding space.
   After the pre-training step (3.2), we trained a bi-encoder using the query-document anno-
tations described in 3.3. This bi-encoder was used to generate document embeddings for the
entire corpus, giving us an embedding space of size 𝑚 * 𝑛, where 𝑚 is the embedding size and 𝑛
is the total number of documents. This embedding space was then indexed by a dense retriever
as described in 3.5.2.
   Note: Each document in the corpus consists of premises and a conclusion. To generate
document embeddings, we ignore the conclusion and use only the premises.

3.4.2. Cross-Encoder
A cross-encoder is analogous to the standard BERT design where full-attention is applied across
tokens over an input sentence pair. While a bi-encoder takes two inputs and returns two
representations (or embeddings), cross-encoders take two inputs and return a single decision
directly. They outperform bi-encoders on pair-wise sentence scoring tasks at the cost of speed.
   Since cross-encoders are slow and do not produce independent embeddings, they cannot be
used for retrieval tasks. We used them in the second step of our pipeline to re-rank documents
where a cross-encoder was trained (after MLM pretraining 3.2) on the annotations as described
in 3.3. As a baseline, we also made use of a cross-encoder pretrained on the MSMARCO dataset.
[14]

3.5. Retrieval Models
3.5.1. Sparse: BM25 (Elasticsearch)
BM25 is a traditional bag-of-words-based retrieval function which scores the relevancy of
documents for a given query using the frequencies of common terms between the query and
document. As a variation of the TF-IDF function, it is sensitive to the token frequencies as well
as their inverse document frequencies.
   Due to its simplicity, computational efficiency, and performance, BM25 serves as a critical
component of large-scale search applications and serves as the de facto industrial standard in IR
tasks. To index our id-document pairs, we used the implementation available in Elasticsearch7
with the default settings enabled.

3.5.2. Dense: Approximate Nearest Neighbours (hnswlib)
Despite its robustness, BM25 has several shortcomings. It suffers from the lexical gap problem
[15], a common occurrence in systems built on sparse representations; empirical results have
also shown that it overly penalizes very long documents [16].
   To overcome the above problems, we deployed BM25’s sparse retriever alongside a dense
retriever. Experimental results demonstrate that the contextual text representations from BERT
are more effective than BM25 on retrieval tasks [4].
   Constructing a dense retriever was a two step process: first, we encoded the entire corpus
into a dense vector space using the bi-encoder described in 3.4.1. Second, the representations

   7
       https://lucene.apache.org/core/7_0_1/core/org/apache/lucene/search/similarities/BM25Similarity.html
Figure 2: Pipeline for weakly-supervised learning (Figure taken from the Augmented SBERT paper
[18])


created by the bi-encoder were indexed using a library that implements approximate nearest
neighbours search (hnswlib).8
   Approximate nearest neighbour search is an important step in efficiently generating similar
document vectors for a given query vector. The alternative is to attempt cosine-similarity of
the query vector with every single document vector i.e. brute force. We chose hnswlib since
systems based on hierarchical navigable small world graphs (HNSW) [17] represent the current
state-of-the-art in approximate nearest neighbour search.9

3.6. Data Augmentation
For our data augmentation approach, we utilized the methodology described in the Augmented
SBERT paper [18] where a pre-trained cross-encoder was used to weakly label a sample of
unlabeled query-document pairs. The query-document pairs were sampled using BM25, fed
into a cross-encoder trained on MSMARCO to generate silver labels, which were then appended
to the gold training set to train an augmented bi-encoder. (Figure 2)

3.7. Retrieve and Re-rank
The two-step pipeline of retrieve and re-rank has been known to work well on IR tasks. Given a
search query, the first step is to retrieve a large list of candidate documents which are potentially
relevant for the query. For the retrieval stage, we experimented with a sparse retriever (3.5.1), a
dense retriever (3.5.2), and a combination of both (by simply appending the outputs of the two
retrievers).
   In the second step, we used a re-ranker based on a cross-encoder (3.4.2) that scores the
relevancy of all the retrieved candidates (Figure 3). We experimented with a custom cross-
encoder trained on our annotations and a pre-trained cross-encoder trained on the MSMARCO
dataset. For each query, 100 candidate documents were retrieved and sent to the cross-encoder
    8
        https://github.com/nmslib/hnswlib
    9
        http://ann-benchmarks.com/
Figure 3: Overview of the retrieve and re-rank pipeline: candidate documents are retrieved using a
sparse/dense retriever and then re-ranked using a pretrained/custom cross-encoder


Table 1
Summary of evaluation (Measure used: nDCG@5)
          Run        Retriever       Augmenter               Reranker           Relevance   Quality
           1         Sparse              No         Pretrained Cross Encoder     0.456      0.525
           2        Sparse               No         Custom Cross Encoder         0.608      0.803
           3     Sparse + Dense          No          Custom Cross Encoder        0.607      0.783
           4         Dense               No          Custom Cross Encoder        0.507       0.75
           5     Sparse + Dense          Yes         Custom Cross Encoder        0.554      0.752


for the purposes of re-ranking. After re-ranking, only the top 50 documents were included in
the final submission file.


4. Evaluation
We performed evaluation using the relevance judgements10 and quality judgements11 provided
by the organisers of the shared task. The metric used was nDCG@5. The results are in Table 1.


5. Conclusion
In this paper, we outlined Team Macbeth’s contribution to the CLEF lab Touché. Our central
approach consisted of using tried-and-tested methods for information retrieval and re-ranking.
We pre-trained the args.me corpus on a masked language modeling task, re-annotated the
relevance arguments from Touché 2020, and attempted neural methods for both retrieval and
re-ranking. The combination of a sparse retriever and a custom neural re-ranker stands out as
the best method in terms of both argument relevance as well as argument quality.


   10
        https://webis.de/events/touche-21/touche-task1-51-100-relevance.qrels
   11
        https://webis.de/events/touche-21/touche-task1-51-100-quality.qrels
References
 [1] A. Bondarenko, L. Gienapp, M. Fröbe, M. Beloucif, Y. Ajjour, A. Panchenko, C. Biemann,
     B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of touché 2021: Argument
     retrieval, 2021.
 [2] Y. Ajjour, H. Wachsmuth, J. Kiesel, M. Potthast, M. Hagen, B. Stein, Data acquisition for
     argument search: The args.me corpus, in: KI, 2019.
 [3] R. Nogueira, K. Cho, Passage re-ranking with bert, 2020. arXiv:1901.04085.
 [4] V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Y. Wu, S. Edunov, D. Chen, W. tau Yih, Dense
     passage retrieval for open-domain question answering, in: EMNLP, 2020.
 [5] Y. Luan, J. Eisenstein, K. Toutanova, M. Collins, Sparse, dense, and attentional representa-
     tions for text retrieval, 2021. arXiv:2005.00181.
 [6] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, 2019. arXiv:1810.04805.
 [7] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoy-
     anov, Roberta: A robustly optimized bert pretraining approach, 2019. arXiv:1907.11692.
 [8] S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, N. A.
     Smith, Don’t stop pretraining: Adapt language models to domains and tasks, 2020.
     arXiv:2004.10964.
 [9] D. Hauser, G. Paolacci, J. J. Chandler, Common concerns with mturk as a participant pool:
     Evidence and solutions, 2018. URL: psyarxiv.com/uq45c. doi:10.31234/osf.io/uq45c.
[10] D. Schlagwein, D. Cecez-Kecmanovic, B. Hanckel, Ethical norms and issues in crowd-
     sourcing practices: A habermasian analysis, Information Systems Journal (2018).
     doi:10.1111/isj.12227.
[11] N. Reimers, I. Gurevych, Sentence-transformers documentation, https://www.sbert.net/,
     2019. (Accessed on 05/28/2021).
[12] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks,
     2019. arXiv:1908.10084.
[13] S. Humeau, K. Shuster, M.-A. Lachaux, J. Weston, Poly-encoders: Architectures and
     pre-training strategies for fast and accurate multi-sentence scoring, in: ICLR, 2020.
[14] P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, R. Majumder, A. McNamara,
     B. Mitra, T. Nguyen, M. Rosenberg, X. Song, A. Stoica, S. Tiwary, T. Wang, Ms marco: A
     human generated machine reading comprehension dataset, 2018. arXiv:1611.09268.
[15] A. Berger, R. Caruana, D. A. Cohn, D. Freitag, V. Mittal, Bridging the lexical chasm:
     statistical approaches to answer-finding, in: SIGIR ’00, 2000.
[16] Y. Lv, C. Zhai, When documents are very long, bm25 fails!, Proceedings of the 34th inter-
     national ACM SIGIR conference on Research and development in Information Retrieval
     (2011).
[17] Y. A. Malkov, D. A. Yashunin, Efficient and robust approximate nearest neighbor search
     using hierarchical navigable small world graphs, 2018. arXiv:1603.09320.
[18] N. Thakur, N. Reimers, J. Daxenberger, I. Gurevych, Augmented sbert: Data augmen-
     tation method for improving bi-encoders for pairwise sentence scoring tasks, 2021.
     arXiv:2010.08240.

</pre>