LeviRANK: Limited Query Expansion with Voting Integration for Document Retrieval and Ranking Notebook for the Touché Lab on Argument Retrieval at CLEF 2022 Ashish Rana1,* , Pujit Golchha1,* , Roni Juntunen1,2,* , Andreea Coajă1,* , Ahmed Elzamarany1,* , Chia-Chien Hung1 and Simone Paolo Ponzetto1 1 Data and Web Science Group, University of Mannheim, Germany 2 Lappeenranta-Lahti University of Technology LUT, Finnland Abstract In order to make informed decisions in personal life, the information available on the internet is often overwhelming, and thus comparative decision-making is particularly challenging. Given the plethora of online resources, it often results in wastage of resources for finding relevant and correct responses. The Touché 2022 Shared Task 2 for Comparative Questions focuses on solving this problem by retrieving corresponding documents given a comparative question. The importance of the retrieved documents is determined both by their relevance and quality. In this paper, we present a three-stage retrieval, ranking, and stance prediction system called the LeviRANK. It uses bidirectional self-attention-based language models for argumentation detection in documents. In the first stage, it incorporates an empirically novel retrieval approach that produces the highest recall values for small comparative queries. The retrieval module uses voting-based BM25 retrieval for merging multiple BM25 retrievals from a pool of relevant expanded queries. We then use monoT5 and duoT5 document rankers based on the "Expando-Mono-Duo" design pattern. Finally, we identify object stance by building a two-step stance prediction approach which first separates out documents that are specifically related to objects and further identifies the given relevant object in them. With the proposed approach, we observe that bidirectional self-attention-based document ranking models successfully identify argumentation structure better than the probabilistic ranking models. The LeviRANK system ranks the highest mean nDCG@5 score of 0.758 for document relevance task, second-highest nDCG@5 score of 0.744 for document quality task, and second-highest Macro-F1 score of 0.301 for the stance prediction task. Keywords Comparative Question Answering, Document Retrieval, Document Ranking, Multi-Stage Document Ranking 1. Introduction In the current consumer-driven economic landscape with the overabundance of products and their associated information over the web, it is very hard to make informed decisions. Studies have demonstrated that for essential life decisions people prefer online research and comparative CLEF’22: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy $ asrana@mail.uni-mannheim.de (A. Rana); pgolchha@mail.uni-mannheim.de (P. Golchha); roni.juntunen@student.lut.fi (R. Juntunen); acoaja@mail.uni-mannheim.de (A. Coajă); aelzamar@mail.uni-mannheim.de (A. Elzamarany); chia-chien.hung@uni-mannheim.de (C. Hung); ponzetto@uni-mannheim.de (S. P. Ponzetto) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Figure 1: Architectural illustration of the LeviRANK system pipeline includes three steps: initial document retrieval, relevance ranking, and stance prediction modules. The two alternative pipelines with different ranking models, namely (1) monoT5 only (single-stage), and (2) monoT5-duoT5 (multi-stage) are highlighted in the diagram as well. questions are also part of those decisions [1, 2]. These comparative questions can be factual (e.g., which footballer has the most goals?) or contextual (e.g., who is the best footballer?) where the retrieved relevant argumentative contents are often distributed over several sentences or passages [3]. The Argument Retrieval for Comparative Questions shared task is specially designed to handle these comparative topics, retrieve corresponding relevant documents and further give object entailment information. The distribution of context and argumentative nature of the relevant information given a question makes the document ranking and stance prediction problems especially challenging [4]. The problem becomes even more challenging when questions are purely abstract (e.g., what is the best way to live life?) in which the answer would vary based on personal and societal preferences. With our contribution to Touché 2022 Shared Task 2: Argument Retrieval for Comparative Questions, we intend to explore these questions and detect argumentative structures in relevant documents with a self-attention mechanism [5]. For designing our initial document retrieval we use the well-expanded passage corpus with queries generated by DocT5Query provided on the TIRA forum [6, 7, 8]. Based on our analysis of the recall metric, it is observed that query drift is quite prevalent when we are using ad- hoc versions of different supplementary retrieval approaches (e.g., pseudo-relevance, query expansion) leading to lower recall values. Hence in this work, we propose Limited Query Expansion with Voting Integration for Document Retrieval and RANKing (LeviRANK), a novel framework of argument retrieval for comparative questions. The LeviRANK system, depicted in Figure 1 includes three steps, namely (1) Initial Retrieval: this consists of an empirically designed query expansion variant that starts with an initial BM25 retrieval from a limited pool of expanded queries [9, 10, 11]. This fixed pool of expanded queries, is prepared by limiting the different query expansion methods to either replace (e.g., adjectives with their synonym/antonym pairs), add (e.g., pseudo-relevance queries), or remove (e.g., noun-only queries) only one term to restrict query drift. Finally, we use the original query’s BM25 retrieval documents as a driving relevance set for top-1000 relevant documents. We further utilize voting amongst a pool of queries to select the most relevant documents from this 1000 document set to prepare a more concise document relevance set. After that, for each remaining query in the query pool, we append top-retrieved disjoint document sets to our earlier prepared concise document relevance set in a cascading manner. And the corresponding set size for the top-retrieved documents depends on the relevance of each query which is manually tuned for this task by us. (2) Document Ranking: we next utilize the "Expando-Mono-Duo" design pattern for two-stage pointwise and pairwise document ranking from T5 language models [6, 12]. (3) Stance Prediction: we use two- step classification RoBERTa language models [13] to handle the unbalanced multi-class stance prediction problem. In the first step we segregate out documents containing object-relevant information and in the second step, we identify the specific object’s relevance in that document given a topic query. 1 The contribution of our team Captain Levi to this paper is to provide an empirically relevant retrieval approach with limited query expansion for comparative queries. We also show that our demonstrated retrieval approach is more representative of the relevant document space for a given topic query. Additionally, we also investigate the representational capabilities of the bi-directional self-attention-based monoT5 and duoT5 document ranking models. Finally, we also quantify the stance prediction capabilities of the two-step multi-class classification approach both in zero-shot and fine-tuned settings for object entailment tasks. 2. Related Work Argument mining, document retrieval and ranking tasks have been extensively studied with successful deep learning approaches in recent years [4, 14, 15, 16, 17]. Previously, the Argument Retrieval for Comparative Questions task was focused on retrieving relevant argumentative passages from generic web crawl document collections [18, 19]. In related experimental studies, the approaches have primarily utilized ChatNoir by inputting either the original or expanded preprocessed queries for initial document retrieval [20]. And post the initial retrieval, the documents have been ranked by using multiple machine learning and deep learning approaches like random forest, XGBoost, LightGBM, Word2vec, GPT-2, fine-tuned BERT and DistilBERT [21, 22, 23, 24, 25, 26, 27, 28, 29]. In our approach, we additionally tackle the initial retrieval problem by using the text passages expanded with queries generated using DocT5Query instead of using ChatNoir [7, 6]. Industry and academia multi-stage ranking retrieval systems have arguably been one of the most practical solutions for modern search systems [30, 31, 32]. Multi-stage retrieval and document ranking pipelines with dynamic embedding representations obtained from bi- directional language model architectures have achieved great results in the past [4, 14, 15]. The bi-directional self-attention architecture in BERT successfully attends to important tokens and captures the semantic relationship between them [33]. Additionally, upgrading to dynamic masking where different sentence components are masked per epoch increases the robustness and performance as shown by the language models like RoBERTa [34]. Additionally, Text-To- 1 All resources developed as part of this work are publicly available at: https://github.com/softgitron/LeviRank. Text Transfer Transformer (T5) model with a large pre-trained dataset called the Colossal Clean Crawled Corpus (C4) also gives state-of-the-art results. It, unlike BERT-style language models, takes text data both as inputs and outputs. Additionally, with respect to unsupervised pre- training stage innovations like causal prediction task, deshuffled original input text prediction and masked tokens prediction also gives T5 additional performance boost [35]. In recent studies, monoBERT- and duoBERT-based pointwise and pairwise document rele- vance models have achieved promising results on MS MARCO dataset [36, 37]. Additionally, the design pattern has been generalized and dubbed as “Expando-Mono-Duo” with T5 re-ranking models for pointwise and pairwise comparison at consecutive stages [6]. Here, “Expando” refers to document passage expansion with generated queries by DocT5Query model trained on MS MARCO passage ranking task. The “Mono” and “Duo” indicate the pointwise and pairwise com- parisons for document ranking. In this work, we investigate the performance and utility of each component with “Expando-Mono-Duo” pattern. The stated ranking models are not specifically pre-trained on comparative argument retrieval questions but rather on general-purpose MS MARCO ranking dataset queries. This leaves these models biased towards making good ranking predictions for certain topic queries, but not for all the topic queries under consideration for this task. We develop our document retrieval and ranking system by empirically improving upon the limits of this design pattern specific to this use case. Stance prediction formulated as an entailment problem has helped in analyzing different problems like political discourse, scientific misinformation, and comparative questions un- derstanding [14, 15, 38]. Additionally, bi-directional self-attention-based models like BERT, RoBERTa and T5 have shown promising results in both regular and zero-shot learning set- tings [15, 13, 39]. Unbalanced stance prediction problems can be broken into two-stage multi- class classification problems in order to improve distinguishing capabilities amongst classes for smaller datasets [13]. For the LeviRANK framework, we investigate this two-step classification approach by first separating object classes with {No, Neutral, Object} as stance prediction objects and then further classify separated objects into {First, Second} classes. Here, the {No, Neutral, Object} labels highlight the absence, neutrality and presence of the object-related information in the relevant documents corresponding to the given topic query. Additionally, the {First, Second} labels highlight if any object in the relevant document is the answer to the given comparative topic query. For the topic queries present in the Argument Retrieval for Comparative Questions shared task, zero-shot learning performance is measured with Macro-F1 metric. But, we also additionally use a 50/50 test/train split on worst and best query topics respectively from the zero-shot learning task to further fine-tune our models and analyze the performance improvement with the Macro-F1 metric. 3. Datasets The dataset of the Touché 2022 Shared Task 2 consists of: (1) a collection of 868,655 passage documents extracted from the ClueWeb12 where the average document length is approximately 150 words and (2) 50 queries on different topics [40]. These documents can either be non-relevant, relevant, or highly relevant for any given topic query. For stance prediction, an additional dataset with 956 comparative questions and answers [38] are provided. It consists of object detection and classification labels generated from annotating data subsets from Stack Exchange and Yahoo Answers topic classification datasets. In this task, we are given a topic query set 𝒬 and a relevant document corpus 𝒟. For each query q ∈ 𝒬, our aim is to retrieve all uniquely relevant documents d ∈ 𝒟 and help categorize them as y(q,d) 𝑖𝑛 {First, Second, No, Neutral} for object stance prediction. Here, y(q,d) represents the oracle stance prediction function given a query q and document d as inputs. The performance of the retrieval subtask is determined by nDCG@5 (i.e., Normalized Discounted Cumulative Gain over top-5 highest scored documents) [41, 42]. And for the stance prediction problem, the performance is evaluated by Macro-F1 score metric which it derives an averaged-out metric assigning each class an equal weight. 4. Methodology We formulate and divide subtasks for the Argument Retrieval for Comparative Questions task into three stages, namely: retrieval, ranking, and stance prediction. Figure 1 depicts the complete proposed LeviRANK pipeline and the following subsections explain the proposed system in detail. To make informed decisions about the individual implementation components for each subtask we devise our own alternative evaluation strategy which is elaborated in the below subsection. 4.1. Initial Evaluation Strategy for Subtask Module Selection We formulate and divide our evaluation metrics into three stages of the system, namely: retrieval, ranking, and stance prediction. For the retrieval stage, we focus on increasing the Recall@K metric value which measures the maximum number of relevant documents retrieved out of all the possible relevant documents for K number of total retrieved documents by a given system. For the ranking stage, we use the nDCG@5 metric which provides us quantitative insights into what fraction of the top-5 relevant documents gets retrieved in the correct order for the given topic queries. Finally, for the stance prediction evaluation, we consider Macro-F1 score for the entailment task architecture where Macro-F1 weighs F1-score obtained for each class equally. We combine the gold standard labels from the previous two Shared Task 2 iterations having annotated document relevance ranking information. The LeviRANK system’s retrieval and ranking stage performance gets evaluated over 100 topic queries from the past two years. Additionally, to match the document identifiers exactly from the gold standard, we use all the unique ChatNoir urls from this year’s corpus to scrape data and build a new corpus where average document length gets increased to is approximately 4500 words. Although for the previous task iterations the document size is quite large, we believe evaluation of our system on this very large document setup gives us a lower bound on the LeviRANK system’s ranking performance. Because, for the ranking T5 language architectures maximum input document token length is 512 & 256 tokens respectively for monoT5 & duoT5 respectively. Therefore, large web documents of approximately 4500 words get truncated before the majority of argumentation structures can be attended over by self-attention. Finally, for evaluating the stance prediction model we directly use the stance prediction dataset and compare our Macro-F1 results with the existing baseline solution [38]. Table 1 Initial document retrieval metric performance for comparative query topics with Recall@K metric where K={1000, 1500, 2000} represents number of retrieved documents. Retrieval Approach Recall@1000 Recall@1500 Recall@2000 BM25 Baseline 90.18 90.67 91.11 Dense Retrieval 85.70 86.56 87.56 Pseudo-relevance Feedback 89.98 90.59 91.07 LeviRank Voting 90.14 91.08 91.17 4.2. Initial Retrieval Our retrieval problem mathematically summarizes to gathering of relevant documents from corpus {dj , ..., dn } ∈ 𝒟 for topic queries q ∈ 𝒬. First, we do basic preprocessing (e.g., lowercasing after punctuation, query tags and NLTK based stopword removal, WordNet based lemmatization) for the documents and then build the probabilistic BM25 index for these documents with Pyserini package [43, 44, 9]. The corpus 𝒟 used for building index and ranking documents is already expanded with queries generated using DocT5Query. In order to improve Recall@K we additionally try out query expansion and relevance feedback approaches [45]. Continuously, we build a custom dense-representation-based index that utilizes ColBERT representations which were initially fine-tuned for MS MARCO dataset [46]. By comparing average Recall@K scores for each retrieval approach, we quantify the retrieval limitations of each of the approaches listed above. Additionally, with individual query level analysis we observe that different retrieval approaches perform better for different topic queries and the Recall@K trend is not consistent for the topic queries for these retrieval approaches. For this task, we observe that adding, replacing, or removing a few words causes a large query drift. The phenomenon is attributed to the small length of the query often containing a maximum of two nouns and one adjective as a comparator. Hence, with our proposed approach we generate nine new queries and fetch their corresponding BM25 retrieval results. These nine queries are expanded in a limited manner that only adds, removes, and replaces one specific word from the original query. Also, we limit the influence of all these queries by only adding a proportion of their individual BM25 retrieval document list based on their corresponding relevance determined by us experimentally. In the LeviRANK system, the original query is the main driving query from which maximum documents are retrieved and added to the final retrieval document set. To remove irrelevant retrieval results from the initial 1000 documents, we further ensure that each of these documents has at least one existing retrieved copy in the other retrievals from the expanded query pool. This is done by iteratively checking complete retrieval sets from all the expanded queries for each document belonging to the top-1000 document set of the original query. Essentially, meaning that we take input votes for every document in the initial 1000 documents from the query pool retrieval sets to obtain the most relevant set of documents. The following document sets are then appended in a cascading manner: (1) Two disjoint document sets of 150 most relevant retrieval documents each from two queries, first which contain only nouns and another one from which only stopwords are removed; (2) A disjoint document set of 60 most relevant documents each from 3 queries in which synonym replaces the comparative adjective clause; (3) A disjoint document set of 30 most relevant documents each from 2 most relevant antonyms queries with the replaced adjective. Until the 2000 document retrieval count is reached, (4) the pseudo-relevance feedback retrieval document set that is not part of the current extended retrieval document set is appended in equal proportion with the original query’s remaining retrieval set disjointly. With the union of such a diverse set of disjoint document retrievals in their respective proportions based on manually assigned query relevance, we intend to make our retrieval results set close to (i.e., more representative) the ideal retrieval results. Table 1 demonstrates the better performance of our approach with higher values of Recall@1500 and Recall@2000. 4.3. Document Ranking In this subtask, relevant documents 𝒟 ̂︀𝐾 (q,d) = { d1 (q,d), ..., dK (q,d) } are ranked, where each document comprises of sentences { s1 (q,d), ..., sY (q,d) }. Here, 𝒟 ̂︀𝐾 (q,d) represents documents from the initial retrieval. Also, s1 (q,d) and sY (q,d) denote the index of sentences from 1 to Y. We use all the top-K (K=2000) retrieved documents for each query from the previous subtask to rank them with monoT5’s default implementation from PyGaggle library without fine-tuning [47]. The given language model receives input sequence and produces binary true/false label target tokens for the documents. At inference time, for probability computations the model outputs a condensed single relevance score using softmax of the true/false label logits for each document. The monoT5 performs pointwise comparison between the documents. Additionally, we use duoT5 model to further granularize the document ranking with pairwise comparison. For the second re-ranking stage we use the top-k (k=100) relevant documents from the mono-T5 stage and again rank them with the duoT5 model. In this re-ranking subtask, rele- vant documents 𝒟 ̂︀𝑘 (q,d) = {d1 (q,d), ..., dk (q,d)} are ranked, where each document comprises of sentences { s1 (q,d), ..., sY/2 (q,d) }. Here, 𝒟̂︀𝑘 (q,d) represents documents after the monoT5 ranking stage. s1 (q,d) and sy/2 (q,d) are the index of sentences from 1 to Y/2 which is half of original document length of 512 tokens. The given language model receives input sequence as and outputs are given by single relevance score using Syn-Sum method given by the equation 1. Here, in the equation 1, i represents the given document and si equals the document’s given score. Also, j gives a document for comparison and Ji represents docu- ment comparisons for document i. Finally, pi,j and pj,i represents pairwise score for document i compared with document j and vice-versa respectively. ∑︁ Sys-Sum: 𝑠𝑖 = (𝑝𝑖,𝑗 + (1 − 𝑝𝑗,𝑖 )) (1) 𝑗∈𝐽𝑖 As the results shown in Table 2, monoT5 model performs the best for the document ranking task on previous year topic queries with the given merged documents corpus. We also observe that the average nDCG@5 values for BM25, monoT5-only, monoT5-duoT5 are 0.33, 0.47, and 0.31 respectively. With our individual query level analysis based on nDCG@5 metric scores, duoT5 model performance is inconsistent while handling large documents. The input sequence Table 2 Document ranking results of different ranking approaches with their corresponding nDCG@5 scores. Ranking Approach BM25 monoT5-only monoT5-duoT5 nDCG@5 0.33 0.47 0.31 Table 3 Stance prediction F1-score results of the LeviRANK system compared with the most accurate sentiment prompt-based system result from Bondarenko et al. [38]. Approach No object Neutral Object 1 Object 2 Macro-F1 Bondarenko et al. [38] 0.40 0.53 0.70 0.63 0.57 LeviRank 0.40 0.52 0.72 0.68 0.58 size of the duoT5 model reduces to 256 tokens for both document sequences under analysis in comparison to the 512 tokens of monoT5 model, which results in loss of relevant argumentation information. Additionally, with our manual analysis, it is indicated that the starting section of the web documents from the corpus often contains irrelevant information (e.g., links, headers), which further contributes to the performance drop of the duoT5 model for this subtask. 4.4. Stance Prediction In this subtask, we use the predicted relevant documents 𝒟 ̂︁′ 𝑘 (q,d) = {d ’1 (q,d), . . . , d ’m (q,d)} from the ranking retrieval stage to predict the object stance ŷ(q,d) of the comparative queries q ∈ 𝒬. Here, 𝒟̂︁′ 𝑘 (q,d) represents documents ranked after the second reranking stage. Additionally, ŷ(q,d) and y(q,d) represent the predicting and oracle stance functions of the given document-query pairs. We formulate this subtask as a two-stage binary classifier problem where the first classifier separates and predicts the documents with ŷ(q,d)= {No, Neutral} and {Object} labels with input sequence . And, the second classifier predicts the stance ŷ(q,d)={First, Second} with input representation . The pre-trained RoBERTA-Large-MNLI language model is used for Object Separator predictions and pre-trained RoBERTA-Large-MNLI is chosen for predicting the final Object Stance. The results in Table 3 demonstrate the advantage of using the two-step binary classification process for the LeviRANK system given unbalanced small datasets in comparison to traditional four-way multi-class classification. We attribute this performance gain to better prediction of the Object labels with high F1-score and Recall scores of 84.4% and 93.93% respectively. Multi-class classification models perform poorly for predicting undersampled classes due to their label scarcity in the dataset. Random oversampling with replacement of the underrepresented classes combined with cross-validation is further implemented for increasing prediction capabilities amongst No and Neutral classes. Hence, by using these techniques the LeviRANK’s two-step Table 4 Submission result summary from leaderboard of the Touché Shared Task 2: Argument Retrieval for Comparative Questions for the LeviRANK system. Submitted Approaches Recall@2K InputduoT5 nDCG@5rel nDCG@5qual TCT-ColBERT+monoT5+duoT5 92.05 100 0.7581 0.7443 BM25+monoT5+duoT5 98.23 100 0.7552 0.7424 LeviRANK+PR+monoT5+duoT5 97.96 50 0.7533 0.7305 LeviRANK+monoT5 98.34 0 0.7274 0.7066 Pseudo-Relevance(PR)+monoT5 97.16 0 0.7225 0.6957 classification approach reduces false positive predictions of No and Neutral classes against the Object class. Additionally, our proposed approach performs significantly better for First and Second object labels which makes it even a better alternative. Since, for comparative topic queries it is highly important to know which object is being discussed in either positive or negative light within the very highly relevant documents. 5. Results and Error Analysis The leaderboard results for the Touché Shared Task 2: Argument Retrieval for Comparative Questions is summarized on the Table 4. Our submission set included five systems in total, where three of them use the monoT5-duoT5 ranking & reranking architecture whereas the other two only include monoT5 ranking architecture. The monoT5 only architecture was selected as a fallback architecture approach because of unexpected duoT5 ranking results for large documents. The specific module component details for each submission and respective subtask is highlighted in the Submitted Approaches column of Table 4. From Table 4, we observe that the LeviRANK initial retrieval approach achieves the highest Recall@2000 score value as expected. But, we also infer that the input size to duoT5 model is one of the major result driving factors for obtaining the highest nDCG@5 score. Additionally, by comparing TCT-ColBERT [48] and BM25 based systems we can observe that the higher Recall@2000 score doesn’t necessarily guarantee best nDCG@5 results for both quality & relevance metric score. Also, quality and relevance requires separate model architectural design for performance improvement. Even though our models achieve the highest nDCG@5 scores for relevance, they still miss out on producing documents of the highest quality. For stance prediction subtask our approach obtained the Macro-F1 score of 0.301 with the two-step classification approach. This sub-par performance can be especially attributed to the low performance while predicting No and Neutral classes. Since the evaluation dataset was not available during the stance prediction model, we can interpret this result as zero-shot learning performance obtained after training on the Stack Exchange & Yahoo Answers topic classification datasets. Further, for measuring the stance prediction capabilities of our two-step stance prediction approach we fine-tune our entailment language models on the annotations corresponding to the top-50 % best query topics. Where top-50 % best query topics represents Table 5 Stance prediction Macro-F1 score results of the LeviRANK system measuring (1.) Zero-shot performance on the complete annotated relevant documents dataset, (2.) Zero-shot performance on the annotations corresponding to top-50 % worst performing topics, (3.) Performance evaluation on the annotations corresponding to top-50 % worst performing topics after fine-tuning on the annotations corresponding to the top-50 % best performing topics. Training Approach Prediction Annotations Set Macro-F1 Zero-shot Two-Step RoBERTA-MNLI architecture Whole stance dataset 0.3032 Zero-shot Two-Step RoBERTA-MNLI architecture Worst 50 % topic queries 0.1166* Two-Step RoBERTA-MNLI (fine-tuned, 50 % best topics) Worst 50 % topic queries 0.3871* Figure 2: Qualitative latent representation comparison between retrieved large size documents for topic queries at the Initial retrieval stage, demonstrating more representationally spreaded initial retrieval of the LeviRANK system. the top-25 topics for which our entailment models achieved the best Macro-F1 score. And, obtain the highest Macro-F1 score of 0.387 for the unseen annotations on query topics for which our system’s performance was reportedly the worst, as shown in Table 5. This improvement in performance can directly be attributed to addition of new training annotations in the RoBERTa- MNLI language models. As illustrated in Figures 2 and 3 with the t-SNE (t-distributed Stochastic Neighbor Embedding) representation plots, it is encouraging to see that the 2000 document set representational spread from our proposed system the LeviRANK is far higher as compared to the baseline BM25 and relevance-feedback-based retrieval results [49]. Also, the t-SNE plots are concerned with pairwise distances between the points and attempts to visualize high-dimensional data in a low-dimensional 2D space. Here, the individual axes in the t-SNE have no quantifiable meaning and these plots are referred for qualitative analysis. Additionally, it is observed that the higher latent vector space spread is valid for query topics belonging to multiple different topic domains as well. We argue that our proposed methodology successfully retrieves documents that contain a high degree of variances amongst themselves irrespective of the document size alongside Figure 3: Qualitative latent representation comparison between retrieved regular size documents for topic queries at the Initial retrieval stage, again demonstrating more representationally spreaded initial retrieval of the LeviRANK system. Figure 4: Latent representation comparison between ranked documents by the monoT5 and duoT5 models at the large document ranking stage demonstrates strong document distinguishing capabilities amongst top-ranked documents for different queries. capturing most relevant documents. This gives the retrieval performance boost when a very large number of documents are retrieved for our proposed multiple retrieval-based voting and merging approach as evident by highest Recall@2000 scores. When considering experiments with large documents, inconsistent behavior is observed for the duoT5 model where for some queries it performs substantially well but its ranking capabilities suffers in general leading lower nDCG@5 score than monoT5. But, for regular size document sequences as shown in Table 4 this issue doesn’t reproduce itself. As depicted in Figures 4 and 5, we analyze the ranking behavior of monoT5 and duoT5 on groups of similar and dissimilar queries with respect to the top-75 document set retrievals. From the ranking t-SNE plots of both the models on similar and dissimilar queries, it is clear to see that top-ranked documents have a more separated clustered structure for monoT5 model as compared to duoT5 in Figure 4 and vice-versa for Figure 5. We further argue that for large documents, superior distinguishing capabilities amongst the top-ranked documents by monoT5 model, in general, Figure 5: Latent representation comparison between ranked documents by the monoT5 and duoT5 models at the regular document ranking stage demonstrates stronger document distinguishing capabilities amongst top-ranked documents for different queries. are the reason for these relatively disjoint clusters and their better performance. Specifically, since the retrieval document corpus of most relevant documents is the same for both models, duoT5 is biased towards selecting particular sets of large documents leading to a reduced ability to produce disjoint clusters for different topic queries. This performance analysis is strictly limited to this large document scenario. Since we find the opposite results in Figure 5 where the duoT5 model for both similar and dissimilar queries produces more distinct clusters proving its superior distinguishing capabilities. 6. Conclusion and Future Work In this work, we propose the LeviRANK system, which uses multi-stage reranking architecture to rank relevant documents for comparative questions. For this system, we implement a novel retrieval approach that systematically merges retrieval results from the restricted query pool based on voting. Additionally, retrieval results are appended in a cascading manner where the appended retrieval result size depends on the relevance assigned to the query. This retrieval approach also attempts to find synergy amongst multiple retrieval techniques like relevance feedback, query expansion, and docT5query for improving Recall@2000 result values. This cascading retrieval merging approach achieves the highest Recall@2000 values of 91.17 and 98.42 for the combined previous two years and current Touché Shared Task 2’s comparative topic queries respectively. We further investigate the performance of the "Expand-Mono-Duo" design pattern for the ad hoc retrievals. For the Touché Task 2: Argument Retrieval for Comparative Questions, our ranking pipeline obtains the best performance of 0.758 for nDCG@5 metric in the relevance evaluation task and the second-best performance of 0.744 for nDCG@5 metric in the quality evaluation task. With these results, we conclude that bi-directional self-attention models successfully capture comparative argumentative structure for given topic queries especially for medium-length documents in a pairwise document comparison setting. Further, we observe that our system suffers in ranking performance when document size becomes large, especially the duoT5 model. This performance decrease is attributed to a lack of argumentation structure being present within the maximum input token length limit for these T5 language model architectures. For the stance prediction task, our model achieves a Macro-F1 score of 0.301 which is lower than the Macro-F1 obtained in the dev-set of the stance prediction dataset. This decrease in performance can be attributed to the especially low prediction performance on the No and Neutral labels. The LeviRANK system in summary provides the best relevant document results out of all the existing systems. And further, gives competitive performance while predicting the stance of the retrieved documents. For future work, we intend to systematically study the causes of inconsistencies in document relevance & quality assessment ranking results amongst different query topics and qualitatively produce more accurate & consistent ranking systems. Acknowledgments We would like to extend our gratitude towards our professor Dr. Simone Paolo Ponzetto and tutor Chia-Chien Hung for sharing their valuable experience and insights with us. We also appreciate the contribution efforts of Touché team for providing all the extended support and help. References [1] E. Turner, L. Rainie, Most americans rely on their own research to make big decisions, and that often means online searches (2020). [2] A. Bondarenko, P. Braslavski, M. Völske, R. Aly, M. Fröbe, A. Panchenko, C. Biemann, B. Stein, M. Hagen, Comparative web search questions, in: Proceedings of the 13th International Conference on Web Search and Data Mining, 2020, pp. 52–60. [3] H. Trivedi, H. Kwon, T. Khot, A. Sabharwal, N. Balasubramanian, Repurposing entailment for multi-hop question answering tasks, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 2948–2958. URL: https://aclanthology.org/ N19-1302. doi:10.18653/v1/N19-1302. [4] J. Lawrence, C. Reed, Argument mining: A survey, Computational Linguistics 45 (2020) 765–818. [5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polo- sukhin, Attention is all you need, Advances in neural information processing systems 30 (2017). [6] R. Pradeep, R. Nogueira, J. Lin, The expando-mono-duo design pattern for text ranking with pretrained sequence-to-sequence models, 2021. URL: https://arxiv.org/abs/2101.05667. doi:10.48550/ARXIV.2101.05667. [7] R. Nogueira, J. Lin, A. Epistemic, From doc2query to doctttttquery, Online preprint 6 (2019). [8] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture, in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World, The Information Retrieval Series, Springer, Berlin Heidelberg New York, 2019. doi:10.1007/ 978-3-030-22948-1\_5. [9] J. Lin, X. Ma, S.-C. Lin, J.-H. Yang, R. Pradeep, R. Nogueira, Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations, in: Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), 2021, pp. 2356–2362. [10] K. S. Jones, S. Walker, S. E. Robertson, A probabilistic model of information retrieval: development and comparative experiments: Part 2, Information processing & management 36 (2000) 809–840. [11] S. Robertson, H. Zaragoza, et al., The probabilistic relevance framework: Bm25 and beyond, Foundations and Trends® in Information Retrieval 3 (2009) 333–389. [12] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, arXiv preprint arXiv:1910.10683 (2019). [13] X. Zeng, A. Zubiaga, Qmul-sds at sciver: Step-by-step binary classification for scientific claim verification, arXiv preprint arXiv:2104.11572 (2021). [14] J. Thorne, A. Vlachos, C. Christodoulopoulos, A. Mittal, FEVER: a large-scale dataset for fact extraction and VERification, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics, New Orleans, Louisiana, 2018, pp. 809–819. URL: https://aclanthology.org/N18-1074. doi:10. 18653/v1/N18-1074. [15] D. Wadden, S. Lin, K. Lo, L. L. Wang, M. van Zuylen, A. Cohan, H. Hajishirzi, Fact or fiction: Verifying scientific claims, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 7534–7550. URL: https://aclanthology.org/2020.emnlp-main.609. doi:10. 18653/v1/2020.emnlp-main.609. [16] S. MacAvaney, A. Yates, A. Cohan, N. Goharian, Cedr: Contextualized embeddings for document ranking, in: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2019, pp. 1101–1104. [17] R. Nogueira, Z. Jiang, J. Lin, Document ranking with a pretrained sequence-to-sequence model, arXiv preprint arXiv:2003.06713 (2020). [18] A. Bondarenko, M. Fröbe, M. Beloucif, L. Gienapp, Y. Ajjour, A. Panchenko, C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2020: Argument Retrieval, in: A. Arampatzis, E. Kanoulas, T. Tsikrika, S. Vrochidis, H. Joho, C. Lioma, C. Eickhoff, A. Névéol, L. Cappellato, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. 11th International Conference of the CLEF Association (CLEF 2020), volume 12260 of Lecture Notes in Computer Science, Springer, Berlin Hei- delberg New York, 2020, pp. 384–395. URL: https://link.springer.com/chapter/10.1007/ 978-3-030-58219-7_26. doi:10.1007/978-3-030-58219-7\_26. [19] A. Bondarenko, L. Gienapp, M. Fröbe, M. Beloucif, Y. Ajjour, A. Panchenko, C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2021: Argument Retrieval, in: K. Candan, B. Ionescu, L. Goeuriot, H. Müller, A. Joly, M. Maistro, F. Piroi, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. 12th International Conference of the CLEF Association (CLEF 2021), volume 12880 of Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2021, pp. 450–467. URL: https://link.springer.com/chapter/10.1007/978-3-030-85251-1_28. doi:10. 1007/978-3-030-85251-1\_28. [20] J. Bevendorff, B. Stein, M. Hagen, M. Potthast, Elastic chatnoir: Search engine for the clueweb and the common crawl, in: European Conference on Information Retrieval, Springer, 2018, pp. 820–824. [21] V. Chekalina, A. Panchenko, Retrieving comparative arguments using ensemble methods and neural information retrieval, Working Notes of CLEF (2021). [22] T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 2016, pp. 785–794. [23] T. Abye, T. Sager, A. J. Triebel, An open-domain web search engine for answering compar- ative questions., in: CLEF (Working Notes), 2020. [24] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, T.-Y. Liu, Lightgbm: A highly efficient gradient boosting decision tree, Advances in neural information processing systems 30 (2017). [25] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781 (2013). [26] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are unsupervised multitask learners, OpenAI blog 1 (2019) 9. [27] T. K. H. Luu, J.-N. Weder, Argument retrieval for comparative questions based on indepen- dent features (2021). [28] A. Alhamzeh, M. Bouhaouel, E. Egyed-Zsigmond, J. Mitrović, Distilbert-based argumenta- tion retrieval for answering comparative questions, Working Notes of CLEF (2021). [29] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, arXiv preprint arXiv:1910.01108 (2019). [30] N. Asadi, J. Lin, Effectiveness/efficiency tradeoffs for candidate generation in multi-stage retrieval architectures, in: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, 2013, pp. 997–1000. [31] R.-C. Chen, L. Gallagher, R. Blanco, J. S. Culpepper, Efficient cost-aware cascade ranking in multi-stage retrieval, in: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2017, pp. 445–454. [32] S. Liu, F. Xiao, W. Ou, L. Si, Cascade ranking for operational e-commerce search, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017, pp. 1557–1565. [33] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [34] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019). [35] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, et al., Exploring the limits of transfer learning with a unified text-to-text transformer., J. Mach. Learn. Res. 21 (2020) 1–67. [36] R. Nogueira, W. Yang, K. Cho, J. Lin, Multi-stage document ranking with bert, arXiv preprint arXiv:1910.14424 (2019). [37] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, L. Deng, Ms marco: A human generated machine reading comprehension dataset, in: CoCo@ NIPS, 2016. [38] A. Bondarenko, Y. Ajjour, V. Dittmar, N. Homann, P. Braslavski, M. Hagen, Towards Understanding and Answering Comparative Questions, in: K. S. Candan, H. Liu, L. Akoglu, X. L. Dong, J. Tang (Eds.), 15th ACM International Conference on Web Search and Data Mining (WSDM 2022), ACM, 2022, pp. 66–74. URL: https://dl.acm.org/doi/10.1145/3488560. 3498534. doi:10.1145/3488560.3498534. [39] R. Pradeep, X. Ma, R. Nogueira, J. Lin, Scientific claim verification with VerT5erini, in: Proceedings of the 12th International Workshop on Health Text Mining and Information Analysis, Association for Computational Linguistics, online, 2021, pp. 94–103. URL: https: //aclanthology.org/2021.louhi-1.11. [40] A. Bondarenko, M. Fröbe, J. Kiesel, S. Syed, T. Gurcke, M. Beloucif, A. Panchenko, C. Bie- mann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2022: Argu- ment Retrieval, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction. 13th International Conference of the CLEF Association (CLEF 2022), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2022, p. to appear. [41] W. B. Croft, D. Metzler, T. Strohman, Search engines: Information retrieval in practice, volume 520, Addison-Wesley Reading, 2010. [42] C. Manning, P. Raghavan, H. Schütze, Introduction to information retrieval, Natural Language Engineering 16 (2010) 100–103. [43] S. Bird, Nltk: the natural language toolkit, in: Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, 2006, pp. 69–72. [44] G. A. Miller, Wordnet: a lexical database for english, Communications of the ACM 38 (1995) 39–41. [45] W. B. Croft, S. Cronen-Townsend, V. Lavrenko, Relevance feedback and personalization: A language modeling perspective., in: DELOS, Citeseer, 2001. [46] O. Khattab, M. Zaharia, Colbert: Efficient and effective passage search via contextualized late interaction over bert, in: Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, 2020, pp. 39–48. [47] R. Pradeep, R. Nogueira, J. Lin, Pygaggle, 2021. URL: https://github.com/castorini/pygaggle. [48] S.-C. Lin, J.-H. Yang, J. Lin, Distilling dense representations for ranking using tightly- coupled teachers, arXiv preprint arXiv:2010.11386 (2020). [49] L. Van der Maaten, G. Hinton, Visualizing data using t-sne, Journal of machine learning research 9 (2008).