Retrieving Comparative Arguments using Deep Language Models Notebook for the Touché Lab on Argument Retrieval at CLEF 2022 Viktoriia Chekalina1 , Alexander Panchenko1 1 Skolkovo Institute of Science and Technology Abstract In this paper, we present a submission to the Touché lab’s Task 2 on Argument Retrieval for Comparative Questions. Our team Katana employs approaches based on pre-trained deep language model architecture ColBERT [1]. This BERT-based architecture is adapted to the text ranking task by learning to represent both queries and documents as vectors and measuring the similarity between them. We use a model trained on a question-answering dataset MSMARCO, with the proposed weights and weights pre-trained by us. We also customize ColBERT for the comparative retrieval domain by fine-tuning the model on the data from the previous years’ Touché competitions. The proposed experiments verify the usefulness of the transfer learning from a large pre-trained ranking language models to the problem of arguments extraction for comparative topics. Ours solutions rank third in both relevance, quality, and stance prediction evaluations. Keywords comparative argument retrieval, natural language processing, neural information retrieval 1. Introduction In everyday life, people are constantly faced with the task of comparing two options: which of the phone models is more reliable, which fuel is environment-friendly, which drug is the most effective. The decision-making process is based not only to comparing the structural features of objects, as suggested, for example, by WolframAlpha1 or Diffen2 , but on considering people’s opinions. The problem of searching on the web for documents with argumentative support for compared objects is a subset of information retrieval tasks problem. The Touché lab’s Task 2 on Argument Retrieval in 2022 [2] proposes to select passages from a corpus of 1 million texts that are most relevant to the user’s comparative queries, as well as to determine their position - which object in the text is proposed as the most suitable. We employ neural-network based approach with a simplified scheme for comparing query and document embeddings. In addition to using the pre-trained large language model, we further trained the model on documents ranked for comparative queries. On the validation dataset, the approach shows competitive performance, but less than the ensemble-based method from the previous year [3]. This work shows the possibility and CLEF’22: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 1 https://www.wolframalpha.com 2 https://www.diffen.com efficiency of the neural network technique based on the matching of the query and document representations relatively to a specific comparative case of informational retrieval. 2. Related work The most relevant to this work are the previous shared tasks Touche 2020 [4] and Touche 2021 [5]. These tasks aimed to rerank documents, retrieved by ChatNoir [6] System as candidates to a comparative request answers. Multiple teams submitted their runs to the shared task as presented in the technical reports of CLEF.3 The main difficulty in finding relevant documents on the web is the large size of the text corpus. Traditionally, search engine systems depict documents using statistic-based features, the computation of which is not complex. For example, the baseline of the 2021 and 2020 comparative shared tasks is created on the BM25F [7] – a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each passage. This baseline performed well - only a few teams in previous years’ competition could overcome it. One of the best improvements solutions was the decision tree ensembles over statistics and comparison features [3], deploying in PyTerrier [8] library. A large volume of texts imposes a limitation on the use of neural networks for ranking documents in a corpus. There are two ways of neural approaches to information retrieval tasks: representation-based models [9] and interaction-based models [10]. The first one computes the representation of the topic and passage separately and only counts the score of interaction for the pair. Interaction-based methods match the query and document in a token or phrase-level. This set of methods is more expensive but most effective. In the proposed paper we deploy architecture, which combines the advantages of both these methods. 3. Data and experimental design 3.1. Data provided for the task The organizers offer the participants 50 comparative questions (topics), for which it was neces- sary to extract and rank passages from the text corpus. Topics for the competition are available online4 . The organizers also provide a corpus of about 0.9 million texts for passage extraction. For stance detection, every topic comprises objects that are compared in it. For stance detection support, a dataset created from comparative questions of the MSMARCO dataset5 is proposed. The dataset includes relevant answers with highlighted objects of comparison in it and their position in the documents. Every text in a dataset has a detected stance. For model validation purposes, the task presents 100 topics and corresponding relevance annotations of the previous year’s competition [4, 5]. These documents were also retrieved from ChatNoir and ranked manually to 0 (not relevant), 1 (relevant), or 2 (highly relevant) scores. The 2020 year assessment contains a common ranking, last year’s competition has a separate 3 http://ceur-ws.org/Vol-2696, http://ceur-ws.org/Vol-2936 4 https://webis.de/events/touche-22/shared-task-2.html 5 https://microsoft.github.io/msmarco judgment for relevance and quality. We use this data to fine-tune the model to comparative sub-task in document retrieval. Besides, last year’s team submissions are available too. 3.2. Datasets The standard learning object for argument ranking consists of a triple: query, positive pas- sage (relevant text), negative passage (irrelevant text). Reading comprehension dataset MS- MARCO (Microsoft Machine Reading Comprehension) [11] includes 1,010,916 anonymized questions from Bing’s query and 8 million passages extracted from the search system Bing. For the training BERT-based model we use MSMARCO-Passage-Ranking, which comprises triplets from the mentioned questions and passages. We use data from the previous years’ Touche tasks to generate a validation dataset and dataset for fine-tuning the ColBERT model. For every topic, we retrieve up to 100 texts from the ClueWeb12 6 corpus using the ChatNoir [6] system, according to Tocuhe’20-21 task rules. The validation dataset was created on 10 topics from 2021 with corresponding quality and relevance qrels. The rest 40 topics and 50 topics from 2020 produce data for adapting the pre-trained model for text ranking in terms of argumentative objects comparison. The 2020 year task topics have only one assessment dimension in qrels. If the score in this is 1 or 2, we treat this text as relevant. Irrelevant pairs were selected from documents with ratings less than 1 or from the search results for different topics, provided that they were not presented in the search results for the current query. In the case of an assessment of 21 years, there are separate judgments among two axes: quality and relevance. We calculate a sum of a quality and relevance score and consider relevant documents having a score equal to or more than 3, otherwise - irrelevant. The statistic of mentioned datasets is in Table 1. Table 1 Statistics of datasets used in training from scratch and fine-tuning. Dataset Task Number of triples MSMARCO-Passage-Ranking train 39 780 810 Dataset based on Touché 2021 fine-tune 46 450 3.3. Evaluation setup For document ranking, we use ColBERT [1] model, pre-trained in several ways. Using the model, in the test stage, we created an index of all documents in the provided collection of text passages. Using this index, we select the top-k most relevant texts to each of the topics. We use auxiliary information about objects under comparison to find them in every ranked document and define document stance using Comparative argumentation machine CAM [12] functionality. We execute produced solutions on the web evaluation platform Tira [13]. The retrieved documents will be assessed manually for both metrics: general relevance and comparison quality. Relevance depicts proximity to the topic and the presence of sufficient argumentative support. Quality refers to good structuring, understandable news, and text styling. 6 http://lemurproject.org/clueweb12 Figure 1: The scheme of Late Interaction matching is used in ColBERT architecture. The similarity of query and document is the sum of the scores between every query token and the most similar document token. Source of the image: [1]. In the validation phase, we use topics of the previous year’s competition as queries. The corpus on which the model builds the index consists of documents from the Chat Noir issue that are relevant to topics. We retrieve documents for every question and compare them to official qrels judgments. 4. Document ranking 4.1. Document ranking with Late Interaction over BERT representations The main architecture we used in the retrieving document task is Contextualized Late Interaction over BERT (ColBERT). ColBERT provides a trade-off between representation-based models with low computational cost and well-performed token interaction-based models. Actually, for approaches with a full interaction matrix between query and document tokens, ColBERT reduces complexity by affording a convolution over the documents’ token. The query and document processing in ColBERT architecture contains 2 steps: • To encode query, we add [𝑄] after [𝐶𝐿𝑆] token, process padded query by BERT, apply convolution and normalization • To encode document, we add [𝐷] after [𝐶𝐿𝑆] token, process padded passage by BERT, apply convolution and normalization, also filter out punctuation symbols and other tokens unimportant under retrieval task. • The conception of Late Interaction (Fig. 1) from the entire document considers only the token that has the maximum similarity with the given query token. The document relevance is estimated as a sum of maximum similarities across all query tokens. • For retrieving in a large-scale set of passages, the faiss library [14] for the efficient similarity search is used. Thus, the ColBERT approach fine-tunes BERT main encoder and learns from the scratch linear layers, filter and embeddings for [𝑞] and [𝑑] symbols. Leveraging on triplets of query, document with high relevance and document with low relevance < 𝑞, 𝑑+ , 𝑑− >, the model optimizes the pairwise softmax cross-entropy loss. 4.2. ColBERT models For passage retrieval in the Touche task, we use three different types of pre-trained ColBERT architecture. ColBERT original The first is a checkpoint, generated at the University of Glasgow 7 on MSMARCO triples using instruction from the official ColBERT repository 8 . ColBERT from scratch We also pre-trained ColBERT architecture, provided in repository, from scratch by ourselves. We use L2 distance between a query and document instead of cosine similarity, since the original paper noted that the faiss index works faster on a square distance. The training process was carried out in a 3 epochs with the learning rate 3e−6, batch size 64, passage length no more than 180, query length 32, similarity 𝑙2, and took about two weeks on a single GPU card. CoBERT fine-tune We also tried to fine-tune the resulting model on data for a comparative question-answer system obtained from information from past competitions and described in section 1. The pre-training procedure was carried out with the following parameters: learning rate 1e−7, batch size 64, passage length no more than 180, query length 32, similarity 𝐿2. The weights are updated using the AdamW optimizer during 10 epochs. 5. Stance detection An additional challenge within the task was to determine the stances of retrieved documents. Stance defines the document’s attitude towards the compared objects: pro first object, pro- second object, neutral, or the absence of attitude. To detect the stance of a given document, we note objects from topic auxiliary data, found them in the document, and consider text between objects’ locations. Comparative Argumentative Machine (CAM) offers the possibility of classification those pieces of text. It decodes them into feature vectors using Infersent [15] and applies a pre-trained XGBoost classifier to features [12]. The output of CAM is considered to be a document stance class. 7 http://www.dcs.gla.ac.uk/~craigm/colbert.dnn.zip 8 https://github.com/stanford-futuredata/ColBERT 6. Results We run the proposed approaches in two stages: in the validation stage the model retrieves and ranks documents for the previous year’s topic over the ChatNoir output, and in the test stage the model ranks passages for a given topics over proposed corpus, at the same time designating their stance. 6.1. Results on validation set Table 2 NDCG@5 results for quality and relevance of retrieved document on validation set. Method Quality Relevance Baseline’21 0.427 0.649 Best Answer’21 0.421 0.591 ColBERT original 0.413 0.474 ColBERT from scratch 0.342 0.314 ColBERT fine-tune 0.322 0.365 The result for every proposed approach obtained on the validation part of data from the previous year’s competition is in Table 2. We compare ColBERT-based approaches to the previous year’s baseline and LGBM Ranker, considered the best answer. The best scores come from the frequency-based feature baseline approach, the second place belongs to the ensembles over statistic and comparative features set. Pre-trained ColBERT provides results slightly worse in terms of quality. In terms of accuracy, the decrease is more significant, but the same in order as the difference between the first and second places scores. ColBERT, trained by our team from scratch, provides a worse result than pre-trained ColBERT. Fine-tuning this version on the dataset from the previous year’s task gives a noticeable increase in relevance, but makes the model perform slightly worse on quality. This may be due to the properties of the Touche-based dataset used for model fine-tuning. It contains passages, less complete and grammatically correct than MSMARCO objects, but at the same time they are more suitable specifically for the comparative subset of questions. 6.2. Results on test set The retrieved documents were assessed manually for two dimentions. The first criteria is relevance - how opportune and supportive answer is contained in passage, the second is rhetorical quality - good styling and well understoodness of the text. The results also contains the F1 macro clssification scores for the stance detection. The results for three criteria for our tean Katana and Top-1 approch in each metrics are in Tables 3. For the ranking document task, ColBERT, trained on the MSMARCO dataset has the best performance according to fine-tuning the model. The difference between the model with downloaded weights and the model trained by us from scratch is not significant. Pre-trained model achive 3rd place in terms of relevance, while model trained from scratch has 3rd place in the quality table. Fine-tuning comparative data impairs the results. It may be due to the quality difference between texts from the main and fine-tuning data - in the MSMARCO case, well-formed natural language passages were composed by humans on the basis of the search system outputs. [11]. The quality of the stance detection towards the objects expectedly depends on the ranking performance - the ColBERT with pre-trained weights also takes third place. Table 3 Final evaluation scores on the test set for Katana team as compared to the Top-1 approaches. Method NDCG@5 relevance NDCG@5 quality F1 stance detection ColBERT original 0.618 (Top-3) 0.643 0.229 (Top-3) ColBERT from scratch 0.601 0.644 (Top-3) 0.221 ColBERT fine-tune 0.574 0.637 0.212 Top-1 approach 0.758 0.774 0.313 7. Conclusion We present our solution for Argument Retrieval for Comparative Questions – a ranking task over a corpus of textual passages. In our submission, we use large pre-trained neural models which match representations of an input text document and a query. More specifically, we experiment with the ColBERT model, based on computational effective late interaction architecture. We employ a model, pre-trained on the question answering dataset MSMARCO. For adapting the model to a particular comparative case, we fine-tune it on a dataset built on a ranked document from previous years’ competition. We also detect stances of every ranked text by using the classification functionality of the comparative argumentative machine: it determines the polarity of text between two objects by the pre-trained InferSent model. According to the manual assessment, the best quality in all metrics - both ranking and stance detection – comes from a model trained on the large dataset MSMARCO showing that the pre-trained model already allows to answer comparative questions decently turning out to be a strong baseline. A straighforward procedure of fine-tuning of this model with the comparative questions worsens the ultimate quality of the model. Source code of our experiments is available online.9 Acknowledgments We thank Maik Frobe for providing the support of the software runs in the TIRA system. 9 https://github.com/sayankotor/touche References [1] O. Khattab, M. Zaharia, Colbert: Efficient and effective passage search via contextualized late interaction over BERT, CoRR abs/2004.12832 (2020). URL: https://arxiv.org/abs/2004. 12832. arXiv:2004.12832. [2] A. Bondarenko, M. Fröbe, J. Kiesel, S. Syed, T. Gurcke, M. Beloucif, A. Panchenko, C. Bie- mann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2022: Argu- ment Retrieval, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction. 13th International Conference of the CLEF Association (CLEF 2022), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2022, p. to appear. [3] V. Chekalina, A. Panchenko, Retrieving comparative arguments using ensemble methods and BERT, in: G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi (Eds.), Proceedings of the Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum, Bucharest, Romania, September 21st - to - 24th, 2021, volume 2936 of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 2354–2365. URL: http://ceur-ws.org/Vol-2936/paper-211.pdf. [4] A. Bondarenko, M. Fröbe, M. Beloucif, L. Gienapp, Y. Ajjour, A. Panchenko, C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2020: Argument Retrieval, 2020, pp. 384–395. doi:10.1007/978-3-030-58219-7_26. [5] A. Bondarenko, L. Gienapp, M. Fröbe, M. Beloucif, Y. Ajjour, A. Panchenko, C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of touché 2021: Argument retrieval - extended abstract, in: D. Hiemstra, M. Moens, J. Mothe, R. Perego, M. Potthast, F. Sebastiani (Eds.), Advances in Information Retrieval - 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28 - April 1, 2021, Proceedings, Part II, volume 12657 of Lecture Notes in Computer Science, Springer, 2021, pp. 574–582. URL: https: //doi.org/10.1007/978-3-030-72240-1_67. doi:10.1007/978-3-030-72240-1\_67. [6] M. Potthast, M. Hagen, B. Stein, J. Graßegger, M. Michel, M. Tippmann, C. Welsch, Chat- Noir: A Search Engine for the ClueWeb09 Corpus, in: B. Hersh, J. Callan, Y. Maarek, M. Sanderson (Eds.), 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2012), ACM, 2012, p. 1004. doi:10.1145/2348283.2348429. [7] S. E. Robertson, H. Zaragoza, M. J. Taylor, Simple BM25 extension to multiple weighted fields, in: D. A. Grossman, L. Gravano, C. Zhai, O. Herzog, D. A. Evans (Eds.), Proceed- ings of the 2004 ACM CIKM International Conference on Information and Knowledge Management, Washington, DC, USA, November 8-13, 2004, ACM, 2004, pp. 42–49. URL: https://doi.org/10.1145/1031171.1031181. doi:10.1145/1031171.1031181. [8] C. Macdonald, N. Tonellotto, Declarative Experimentation in Information Retrieval using PyTerrier, in: K. Balog, V. Setty, C. Lioma, Y. Liu, M. Zhang, K. Berberich (Eds.), ICTIR ’20: The 2020 ACM SIGIR International Conference on the Theory of Information Retrieval, Virtual Event, Norway, September 14-17, 2020, ACM, 2020, pp. 161–168. URL: https: //dl.acm.org/doi/10.1145/3409256.3409829. [9] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, L. Heck, Learning deep structured se- mantic models for web search using clickthrough data, in: Proceedings of the 22nd ACM International Conference on Information Knowledge Management, CIKM ’13, Association for Computing Machinery, New York, NY, USA, 2013, p. 2333–2338. URL: https://doi.org/10.1145/2505515.2505665. doi:10.1145/2505515.2505665. [10] B. Mitra, F. Diaz, N. Craswell, Learning to match using local and distributed representations of text for web search, in: R. Barrett, R. Cummings, E. Agichtein, E. Gabrilovich (Eds.), Proceedings of the 26th International Conference on World Wide Web, WWW 2017, Perth, Australia, April 3-7, 2017, ACM, 2017, pp. 1291–1299. URL: https://doi.org/10.1145/3038912. 3052579. doi:10.1145/3038912.3052579. [11] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, L. Deng, MS MARCO: A human generated machine reading comprehension dataset, CoRR abs/1611.09268 (2016). URL: http://arxiv.org/abs/1611.09268. arXiv:1611.09268. [12] M. Schildwächter, A. Bondarenko, J. Zenker, M. Hagen, C. Biemann, A. Panchenko, An- swering comparative questions: Better than ten-blue-links?, in: L. Azzopardi, M. Halvey, I. Ruthven, H. Joho, V. Murdock, P. Qvarfordt (Eds.), Proceedings of the 2019 Conference on Human Information Interaction and Retrieval, CHIIR 2019, Glasgow, Scotland, UK, March 10-14, 2019, ACM, 2019, pp. 361–365. URL: https://doi.org/10.1145/3295750.3298916. doi:10.1145/3295750.3298916. [13] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture, in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World, The Information Retrieval Series, Springer, Berlin Heidelberg New York, 2019. doi:10.1007/ 978-3-030-22948-1\_5. [14] J. Johnson, M. Douze, H. Jégou, Billion-scale similarity search with GPUs, IEEE Transac- tions on Big Data 7 (2019) 535–547. [15] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, A. Bordes, Supervised learning of universal sentence representations from natural language inference data, CoRR abs/1705.02364 (2017). URL: http://arxiv.org/abs/1705.02364. arXiv:1705.02364.