1. Introduction

Exploring Argument Retrieval for Controversial Questions Using Retrieve and Re-rank Pipelines

Raunak Agarwal

Andrei Koniaev

Robin Schaefer

0 0 Department of Linguistics, University of Potsdam , 14476 Potsdam , Germany

This notebook documents Team Macbeth's contribution to the CLEF 2021 shared task Touché: Argument Retrieval for Controversial Questions. Our approach consists of diferent configurations of a two-step retrieve and re-rank pipeline. We experimented with sparse and dense approaches for argument retrieval and trained query-document cross-encoders for argument re-ranking. Our findings suggest that a sparse retriever combined with a custom re-ranker performed the best out of all our approaches.

eol>Argument Retrieval Sentence Embeddings Semantic Search

1. Introduction 2. Related Work

While standard information retrieval systems have focused largely on sparse bag-of-wordsbased approaches such as BM25, recent trends in IR indicate the performant nature of a two-step retrieval and re-ranking pipeline, where a sizeable number of candidate documents are first retrieved using the aforementioned sparse representations, and then re-ranked using (trainable) neural models [ 3 ].

Attempts are also being made to get rid of sparse representations altogether through the use of dense retrieval systems [ 4 ]. A standard dense retrieval architecture comprises a transformerbased encoder, which is fine-tuned on a given training corpus with queries and relevant documents. The encoded documents are usually added into an inverted index based on approximate nearest neighbours. There is also work which shows that combining sparse and dense representations can further enhance the performance of these IR systems [ 5 ].

Our submissions for the Touché shared task center around the above methods.

3. Experiments 3.1. Experimental Setup

All our experiments were computed on a setup comprising of an Intel Xeon E5-2650 CPU (24 cores, 256 GB RAM) and 2 NVIDIA GTX 1080Ti GPU’s (24 GB VRAM). We also used Weights and Biases3 to track our experiments.

3.2. Pre-training

We pre-trained the entire args.me corpus on a Masked Language Modeling (MLM) task introduced first by BERT [ 6 ] and later modified by Liu et al. in RoBERTa [ 7 ]. RoBERTa demonstrated an improvement on BERT’s performance with a small adaptation to the pre-training task, hence we chose to follow their approach.

Our motivation for pre-training was to make sure that our model first learns from the domaininvariant representations present in RoBERTa-base, and then enhances these representations through (continued) pre-training on our custom domain. This kind of domain-adaptive pretraining has been known to ofer gains in task performance [ 8 ].

We used the hyper-parameters presented in the RoBERTa-base model and trained it for 10 epochs4, generating a domain-specific RoBERTa-base model with perplexity ≈ 4.1.

3.3. Re-annotation

The organisers of Touché 2021 provide the participants with 2298 relevance judgements to allow training/evaluation of their systems. These relevance judgments are the result of crowd-sourcing eforts of Mechanical Turk 5 workers - a practice which has been criticised for its questionable data quality [ 9 ], leaving aside major ethical considerations concerning labour exploitation [ 10 ].

Our initial plan was to use these annotations to train a sentence-pair classifier. After a closer look, however, we found that these annotations were riddled with errors and therefore, not suitable as a training set.

Instead of eliminating their use altogether, we decided to re-annotate all of the 2298 relevance judgements.6 We went through two rounds of annotation for each query-document pair, and achieved the following metric for inter-annotator agreement: Krippendorf’s alpha = 0.39 3https://wandb.ai/ 4https://wandb.ai/ragabet/’roberta-base’ 5https://www.mturk.com/ 6The annotations are available on our git repository.

Due to time constraints, the runs we submitted were trained only on the first round of annotations. The relatively low inter-annotator score suggests our runs would’ve turned out slightly diferent had we trained our models on an average of the two annotation rounds.

3.4. Sentence Embeddings

When it was first introduced, BERT set new state-of-the-art results on various NLP tasks, including question answering, sentence classification, and sentence-pair regression. A big disadvantage of the BERT’s network structure, however, was its inability to generate sentence embeddings based on single-input sequences.

To overcome the above issue, we used UKP Lab’s Sentence-BERT (or SBERT) [ 12 ] which is a modification of the standard BERT architecture. SBERT adds a mean pooling operation on top of the contextualized word vectors generated by BERT/RoBERTa. This enables the generation of semantically meaningful sentence/document embeddings which can be used for downstream tasks. We made use of the regression objective function described in their paper. A pair-wise regressor was trained using cosine-similarity between the two embeddings and (where is the query embedding and is the document embedding). The objective function was optimized using mean-squared-error loss.

The terminology used in SBERT is further refined by Humeau et al. [ 13 ] where the following approaches for pair-wise sentence scoring are defined: Bi-Encoders and Cross-Encoders. (See Figure 1).

3.4.1. Bi-Encoder

The architecture introduced in SBERT is what is now known as a bi-encoder. Using a bi-encoder, each sentence can be encoded into an independent sentence embedding. The creation of these vector representations enables eficient document retrieval through the use of standard similarity measures (such as Euclidean distance/cosine-similarity) in the embedding space.

After the pre-training step (3.2), we trained a bi-encoder using the query-document annotations described in 3.3. This bi-encoder was used to generate document embeddings for the entire corpus, giving us an embedding space of size * , where is the embedding size and is the total number of documents. This embedding space was then indexed by a dense retriever as described in 3.5.2.

Note: Each document in the corpus consists of premises and a conclusion. To generate document embeddings, we ignore the conclusion and use only the premises.

3.4.2. Cross-Encoder

A cross-encoder is analogous to the standard BERT design where full-attention is applied across tokens over an input sentence pair. While a bi-encoder takes two inputs and returns two representations (or embeddings), cross-encoders take two inputs and return a single decision directly. They outperform bi-encoders on pair-wise sentence scoring tasks at the cost of speed.

Since cross-encoders are slow and do not produce independent embeddings, they cannot be used for retrieval tasks. We used them in the second step of our pipeline to re-rank documents where a cross-encoder was trained (after MLM pretraining 3.2) on the annotations as described in 3.3. As a baseline, we also made use of a cross-encoder pretrained on the MSMARCO dataset. [ 14 ]

3.5. Retrieval Models 3.5.1. Sparse: BM25 (Elasticsearch)

BM25 is a traditional bag-of-words-based retrieval function which scores the relevancy of documents for a given query using the frequencies of common terms between the query and document. As a variation of the TF-IDF function, it is sensitive to the token frequencies as well as their inverse document frequencies.

Due to its simplicity, computational eficiency, and performance, BM25 serves as a critical component of large-scale search applications and serves as the de facto industrial standard in IR tasks. To index our id-document pairs, we used the implementation available in Elasticsearch7 with the default settings enabled.

3.5.2. Dense: Approximate Nearest Neighbours (hnswlib)

Despite its robustness, BM25 has several shortcomings. It sufers from the lexical gap problem [ 15 ], a common occurrence in systems built on sparse representations; empirical results have also shown that it overly penalizes very long documents [ 16 ].

To overcome the above problems, we deployed BM25’s sparse retriever alongside a dense retriever. Experimental results demonstrate that the contextual text representations from BERT are more efective than BM25 on retrieval tasks [ 4 ].

Constructing a dense retriever was a two step process: first, we encoded the entire corpus into a dense vector space using the bi-encoder described in 3.4.1. Second, the representations 7https://lucene.apache.org/core/7_0_1/core/org/apache/lucene/search/similarities/BM25Similarity.html created by the bi-encoder were indexed using a library that implements approximate nearest neighbours search (hnswlib).8

Approximate nearest neighbour search is an important step in eficiently generating similar document vectors for a given query vector. The alternative is to attempt cosine-similarity of the query vector with every single document vector i.e. brute force. We chose hnswlib since systems based on hierarchical navigable small world graphs (HNSW) [ 17 ] represent the current state-of-the-art in approximate nearest neighbour search.9

3.6. Data Augmentation

For our data augmentation approach, we utilized the methodology described in the Augmented SBERT paper [ 18 ] where a pre-trained cross-encoder was used to weakly label a sample of unlabeled query-document pairs. The query-document pairs were sampled using BM25, fed into a cross-encoder trained on MSMARCO to generate silver labels, which were then appended to the gold training set to train an augmented bi-encoder. (Figure 2)

3.7. Retrieve and Re-rank

The two-step pipeline of retrieve and re-rank has been known to work well on IR tasks. Given a search query, the first step is to retrieve a large list of candidate documents which are potentially relevant for the query. For the retrieval stage, we experimented with a sparse retriever (3.5.1), a dense retriever (3.5.2), and a combination of both (by simply appending the outputs of the two retrievers).

In the second step, we used a re-ranker based on a cross-encoder (3.4.2) that scores the relevancy of all the retrieved candidates (Figure 3). We experimented with a custom crossencoder trained on our annotations and a pre-trained cross-encoder trained on the MSMARCO dataset. For each query, 100 candidate documents were retrieved and sent to the cross-encoder 8https://github.com/nmslib/hnswlib 9http://ann-benchmarks.com/

Run 1 2 3 4 5

Retriever Sparse

Sparse Sparse + Dense

Dense Sparse + Dense

Augmenter

Reranker

Relevance

Quality No No No No Yes

Pretrained Cross Encoder Custom Cross Encoder Custom Cross Encoder Custom Cross Encoder Custom Cross Encoder for the purposes of re-ranking. After re-ranking, only the top 50 documents were included in the final submission file.

4. Evaluation

We performed evaluation using the relevance judgements10 and quality judgements11 provided by the organisers of the shared task. The metric used was nDCG@5. The results are in Table 1.

5. Conclusion

In this paper, we outlined Team Macbeth’s contribution to the CLEF lab Touché. Our central approach consisted of using tried-and-tested methods for information retrieval and re-ranking. We pre-trained the args.me corpus on a masked language modeling task, re-annotated the relevance arguments from Touché 2020, and attempted neural methods for both retrieval and re-ranking. The combination of a sparse retriever and a custom neural re-ranker stands out as the best method in terms of both argument relevance as well as argument quality. 10https://webis.de/events/touche-21/touche-task1-51-100-relevance.qrels 11https://webis.de/events/touche-21/touche-task1-51-100-quality.qrels

[1]

Bondarenko ,

Gienapp ,

Fröbe ,

Beloucif ,

Ajjour ,

Panchenko ,

Biemann ,

Stein ,

Wachsmuth ,

Potthast ,

Hagen , Overview of touché 2021: Argument retrieval , 2021 .

[2]

Ajjour ,

Wachsmuth ,

Kiesel ,

Potthast ,

Hagen ,

Stein , Data acquisition for argument search: The args .me corpus, in: KI , 2019 .

[3]

Nogueira ,

Cho , Passage re-ranking with bert , 2020 . arXiv: 1901 .04085.

[4]

Karpukhin ,

Oğuz ,

Min ,

Lewis ,

L. Y.

Wu ,

Edunov ,

Chen , W. tau Yih, Dense passage retrieval for open-domain question answering , in: EMNLP , 2020 .

[5]

Luan ,

Eisenstein ,

Toutanova , M. Collins, Sparse, dense, and attentional representations for text retrieval , 2021 . arXiv: 2005 .00181.

[6]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: Pre-training of deep bidirectional transformers for language understanding , 2019 . arXiv: 1810 .04805.

[7]

Liu ,

Ott ,

Goyal ,

Du ,

Joshi ,

Chen ,

Levy ,

Lewis ,

Zettlemoyer ,

Stoyanov , Roberta: A robustly optimized bert pretraining approach , 2019 . arXiv: 1907 .11692.

[8]

Gururangan ,

Marasović ,

Swayamdipta ,

Lo ,

Beltagy ,

Downey ,

N. A.

Smith, Don't stop pretraining: Adapt language models to domains and tasks, 2020 . arXiv: 2004 .10964.

[9]

Hauser , G. Paolacci,

J. J.

Chandler , Common concerns with mturk as a participant pool: Evidence and solutions, 2018 . URL: psyarxiv.com/uq45c. doi: 10 .31234/osf.io/uq45c.

[10]

Schlagwein ,

Cecez-Kecmanovic ,

Hanckel , Ethical norms and issues in crowdsourcing practices: A habermasian analysis , Information Systems Journal ( 2018 ). doi: 10 .1111/isj.12227.

[11]

Reimers , I. Gurevych , Sentence-transformers documentation , https://www.sbert.net/, 2019 . (Accessed on 05/28/ 2021 ).

[12]

Reimers , I. Gurevych , Sentence-bert: Sentence embeddings using siamese bert-networks , 2019 . arXiv: 1908 .10084.

[13]

Humeau ,

Shuster , M. -

A. Lachaux , J.

Weston , Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring , in: ICLR , 2020 .

[14]

Bajaj ,

Campos ,

Craswell ,

Deng ,

Gao ,

Liu ,

Majumder ,

McNamara ,

Mitra ,

Nguyen ,

Rosenberg ,

Song ,

Stoica ,

Tiwary ,

Wang , Ms marco: A human generated machine reading comprehension dataset , 2018 . arXiv: 1611 . 09268 .

[15]

Berger ,

Caruana ,

D. A.

Cohn ,

Freitag ,

Mittal , Bridging the lexical chasm: statistical approaches to answer-finding , in: SIGIR '00 , 2000 .

[16]

Lv ,

Zhai , When documents are very long, bm25 fails! , Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval ( 2011 ).

[17]

Y. A.

Malkov ,

D. A.

Yashunin , Eficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs , 2018 . arXiv: 1603 . 09320 .

[18]

Thakur ,

Reimers ,

Daxenberger , I. Gurevych , Augmented sbert: Data augmentation method for improving bi-encoders for pairwise sentence scoring tasks , 2021 . arXiv: 2010 .08240.