Exploring Document Expansion for Argument Retrieval Notebook for the Touché Lab on Argument Retrieval at CLEF 2021 Alina Mailach, Denise Arnold, Stefan Eysoldt and Simon Kleine Leipzig University, Augustusplatz 10, 04109 Leipzig, Germany Abstract Processing of opinion based information is an increasingly relevant task in modern times. Especially in regards to complex and morally ambiguous topics, users search for high-quality information which leads to the necessity of automatically processing information of argumentative nature. This notebook documents our attempt to improve argument retrieval using expansion methods for documents as a contribution to Touché@CLEF 2021 as team Hua Mulan. Before runtime we expand arguments by pre- dicting queries and hallucinating arguments using Transformer architectures and a more computational efficient approach based on TF-IDF. Compared to ad-hoc retrieval of the original args.me corpus with Dirichlet Language Model argument hallucination improved the baseline when evaluated on argument quality, while no improvements were obtained when argument relevance was evaluated. Keywords Argument retrieval, document expansion, query prediction 1. Introduction The rapid digitalization and development of novel technologies has led to an unprecedented amount of information, that has to be processed by individuals and transforms our society accordingly. Subsequently, this affects especially the ways in which we debate and form opinions – this holds true for simple decision-making as well as for morally ambiguous topics and politics. A great challenge in this respect is the accurate and automatic identification, validation and retrieval of argumentative patterns in order to help users deal with the tremendous amount of information and ease opinion formation processes. The shared Task Touché@CLEF 2021 [1, 2] is the first shared Task focusing on argument retrieval. Task 1 is dedicated to developing methods to identify and score conversational arguments in a search scenario, in which the user tries to find good arguments regarding a relevant, ambiguous topic. In this notebook we describe our findings as Team Hua Mulan after evaluating document expansion methods inspired by approaches in regular information retrieval on the task of retrieving arguments from the args.me corpus [3]. Our work builds upon several contributions to the Touché@CLEF shared Task 1 in 2020 [4]. Closely related but distinct contributions are the query expansion methods using transformers CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " mailach@informatik.uni-leipzig.de (A. Mailach); arnold@studserv.uni-leipzig.de (D. Arnold); se57nafy@studserv.uni-leipzig.de (S. Eysoldt); pge12kaa@studserv.uni-leipzig.de (S. Kleine) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Figure 1: Procedure of document expansion using docTTTTTquery by Akiki & Potthast [5] as well as query expansion using WordNet synonyms [6]. Our approach diverges on the moment of execution as well as on the subject to expansion. While both former contributions focus on expanding queries at runtime, we expand documents prior to indexing. While the first two expansion methods are based on the Transformer architecture, the third one is a more intuitive approach based on finding synonyms. In the following sections we further elaborate on our approaches and findings. 2. Document Expansion A gap that many retrieval approaches try to bridge is the issue of mismatch between terms in a query and terms in documents relevant to this query. This mismatch is caused by different words describing the same content. A possibility to raise the probability of the retrieval of a document which in it’s original form does not contain the keywords in the query, is the enrichment of documents with terms that are not yet contained, but are very likely to be contained by a query that is used to search for this documents. The following part is dedicated to describe the three approaches. This section is describing the implementation and evaluation of expanding documents by predicting relevant queries (2.1), hallucination of arguments (2.2) and extracting synonyms (2.3). 2.1. Query Prediction Predicting queries and augmenting the documents with these predicted queries was intro- duced in 2019 by Nogueira et al. [7] and called doc2query. The approach is grounded on the idea of conceptualizing the retrieval process as a question-answering system, in which the query represents a question for which the user searches the right answer in order to satisfy her information need. The authors build their query-prediction system by training a vanilla sequence-to-sequence model on the MS Marco dataset [8]. In 2020 Nogueira et al. published an improved version, called docTTTTTquery [9], which’s main difference to doc2query is the basement on a T5 (Text-to-Text Transfer Transformer) [10] encoder-decoder architecture which was also trained on approximately 500.000 passage-query pairs and is publicly available in the Table 1 Queries predicted by docTTTTTquery Premises Conclusion Predicted Queries Teachers who perform below Colleges should abolish the • Why should teachers not be benchmarks such as retention, ability for teachers to be tenured? attendance, academic perfor- tenured. mance results, assessing re- • Why should tenured teach- quired learning outcomes and ers be banned? student feedback, should not • Why should tenured teach- be allowed tenure because stu- ers not be allowed to work dents suffer to be successful at a college? and colleges suffer in gradua- tion rates. • Why should tenure be abol- ished? authors github repository1 . For the expansion of the arguments we used the original pretrained model and predicted ten queries per argument. Figure 1 shows the general procedure of the prediction. Premises and conclusion were concatenated, serving as the input and tokenized with Huggingface’s tokenizer2 . The input tensors are truncated to 512 tokens and are subsequently fed to the model. After detokenization, the predicted queries were appended to the original premises. Just as in the original work[9], we did not indicate the expansion with any special characters. In Table 1 an example argument with the predicted queries is given. We expected the additional information added by the predicted queries to reduce the issue of term mismatch and therefore improve retrieval performance. 2.2. Argument Hallucination Akiki & Potthast [5] explored query expansion scenarios using different Transformer methods. Running multiple text sequences generated by Generative Pretrained Transformer 2 (GPT-2) 3 against the index improved retrieval quality compared to a simple baseline. We adapted the idea of generating arguments in a way that fits document expansion. While Akiki & Potthast have generated 24 different sequences for each query, we generated two sequences for each of the 72173 unique conclusions in the args.me corpus. This should give premises belonging to relevant conclusions a boost in the retrieval process by adding words and thus diversifying the language model for those arguments. Just as Akiki & Potthast we wrapped each conclusion in an interview-like scenario using hyphens to indicate the conversational nature of the text. We further augmented the conclusion with a positive or negative prompt, leading GPT-2 in either of those directions. Since the retrieval process and the metric is agnostic towards the stance of an argument, we expanded each 1 https://github.com/castorini/docTTTTTquery 2 https://huggingface.co/transformers/model_doc/t5.html 3 https://huggingface.co/gpt2 Table 2 Example arguments hallucinated by GPT-2 Negative prompt Positive prompt -What do you think of: Colleges should abolish the ability for teachers to be tenured?- The answer is no, because the current system of The answer is yes, because the College of Edu- teaching in America has been a disaster since it cation has been abolished. The only way that it began and continues today (see my post on this can continue as a college in this country and topic). The only way we can get rid from that still exist today would be if we had an inde- situation would be by abolishing teacher tenure pendent school system where all students were at all levels and I’m not talking just one level treated equally regardless what their academic here; there are many more who have already abilities are or how they perform at work (and done so as well! But if they don’t want their not just on campus). This means there wouldn’t children taught like other kids then why bother even have any problem with having one teacher with them when your child will learn something who was able/unable get tenure from his job [...] new every day?! [...] argument with the positively and negatively generated sequences. An exemplary generation can be observed in table 2. Note that we decided not to generate sequences with neutral prompts, since our experiments revealed no relevant information gain compared to the positive and negative versions while increasing runtime by a third. 2.3. Synonym Extraction The former mentioned methods are using complex neural networks and are therefore heavily relying on computational capacities and hardware acceleration. When thinking about real- world retrieval scenarios, expanding all arguments prior to or at indexing time, would lead to computational and runtime issues. Finding a more basic strategy to argument expansion is therefore necessary and interesting. Thus we implemented an approach similar to Bundesmann et al. [6]. For each argument we extracted the main keywords identified using term frequency-inverse document frequency (TF-IDF), indicating that a term is occurring relatively often in a document, compared to the occurrence in the rest of the args.me corpus and therefore is more relevant than other words. We used the scikit-learn library4 for computing TF-IDF and then augmented the argument with synonyms. For each argument the top 10 keywords that appeared in a maximum of 20% of the documents in the corpus were extracted. Subsequently, we searched for synonyms in the WordNet database [11] and appended them to the original premises. On average we extracted 8.80 keywords per argument (with a standard deviation of 2.55) and added on average 25.30 synonyms (with a standard deviation of 11.70). 4 www.scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html Table 3 Results Relevance Quality nDCG@5mean CI95% nDCG@5mean CI95% Baseline (Dirichlet) 0.626 [0.550, 0.691] 0.796 [0.755, 0.838] Query Prediction 0.518 [0.446, 0.588] 0.654 [0.584, 0.724] Argument Hallucination 0.620 [0.545, 0.685] 0.811 [0.770, 0.849] Synonym Extraction 0.620 [0.549, 0.685] 0.789 [0.750, 0.830] 3. Evaluation For the evaluation, each of the augmented corpora was indexed using Elasticsearchs built-in similarity based on Dirichlet Language Model (DirichletLM) to obtain the thousand most fitting arguments. DirichletLM was mainly chosen for performance reasons, as it proved to be most adequate for ad-hoc argument retrieval [12] and methods based on DirichletLM outperformed approaches based on other retrieval methods [4]. We ran non-systematic pre-tests on the corpus with no augmentation and found m = 2148 to retrieve good results. The approaches were evaluated on TIRA platform [13] for comparability and reproducibility. Table 3 shows mean results for relevance and quality of the retrieved arguments. In terms of retrieval, none of our approaches was able to improve baseline ad-hoc retrieval, while argument hallucination using GPT-2 achieved slightly higher results in argument quality. One reason for these results could be the expansion of all documents which leads to boosting less relevant documents. Further research could explore expansion of only high quality arguments to selectively improve retrieval of these documents. 4. Conclusion We investigated the effect of different document expansion methods on argument retrieval. The examined methods tackled the issue of term mismatch using three different generative approaches. We used docTTTTTquery to predict relevant queries and hallucinating arguments using GPT-2. To test another approach for solving the information mismatch we extracted keywords and searched for synonyms in the WordNet corpus. The augmented corpora were indexed and retrieved using DirichletLM. Finally, none of the introduced approaches was able to beat the simple Baseline of ad-hoc retrieval using DirichletLM in terms of argument relevance. When evaluating argument quality, expanding documents using hallucinated arguments slightly improved retrieval. References [1] A. Bondarenko, M. Fröbe, M. Beloucif, L. Gienapp, Y. Ajjour, A. Panchenko, C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2020: Argument Retrieval, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), Working Notes Papers of the CLEF 2020 Evaluation Labs, volume 2696 of CEUR Workshop Proceedings, 2020. URL: http://ceur-ws.org/Vol-2696/. [2] A. Bondarenko, L. Gienapp, M. Fröbe, M. Beloucif, Y. Ajjour, A. Panchenko, C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2021: Argument Retrieval, in: G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi (Eds.), Working Notes Papers of the CLEF 2021 Evaluation Labs, CEUR Workshop Proceedings, 2021. [3] Y. Ajjour, H. Wachsmuth, J. Kiesel, M. Potthast, M. Hagen, B. Stein, Data Acquisition for Argument Search: The args.me corpus, in: C. Benzmüller, H. Stuckenschmidt (Eds.), 42nd German Conference on Artificial Intelligence (KI 2019), Springer, Berlin Heidelberg New York, 2019, pp. 48–59. doi:10.1007/978-3-030-30179-8\_4. [4] A. Bondarenko, M. Fröbe, M. Beloucif, L. Gienapp, Y. Ajjour, A. Panchenko, C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2020: Argument Retrieval, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), Working Notes Papers of the CLEF 2020 Evaluation Labs, volume 2696 of CEUR Workshop Proceedings, 2020. URL: http://ceur-ws.org/Vol-2696/. [5] C. Akiki, M. Potthast, Exploring Argument Retrieval with Transformers, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), Working Notes Papers of the CLEF 2020 Evaluation Labs, volume 2696, 2020. URL: http://ceur-ws.org/Vol-2696/. [6] M. Bundesmann, L. Christ, M. Richter, Creating an argument search engine for online debates, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), Working Notes Papers of the CLEF 2020 Evaluation Labs, volume 2696, 2020. URL: http://ceur-ws.org/Vol-2696/. [7] R. Nogueira, W. Yang, J. Lin, K. Cho, Document expansion by query prediction, CoRR abs/1904.08375 (2019). URL: http://arxiv.org/abs/1904.08375. arXiv:1904.08375. [8] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, L. Deng, MS MARCO: A human generated machine reading comprehension dataset, in: T. R. Besold, A. Bor- des, A. S. d’Avila Garcez, G. Wayne (Eds.), Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016, volume 1773 of CEUR Workshop Proceedings, CEUR-WS.org, 2016. URL: http://ceur-ws.org/Vol-1773/CoCoNIPS_2016_paper9.pdf. [9] R. Nogueira, J. Lin, A. Epistemic, From doc2query to doctttttquery, Online preprint (2019). [10] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, Journal of Machine Learning Research 21 (2020) 1–67. URL: http://jmlr.org/papers/v21/20-074.html. [11] G. A. Miller, Wordnet: A lexical database for english, Commun. ACM 38 (1995) 39–41. URL: https://doi.org/10.1145/219717.219748. doi:10.1145/219717.219748. [12] M. Potthast, L. Gienapp, F. Euchner, N. Heilenkötter, N. Weidmann, H. Wachsmuth, B. Stein, M. Hagen, Argument Search: Assessing Argument Relevance, in: 42nd International ACM Conference on Research and Development in Information Retrieval (SIGIR 2019), ACM, 2019. URL: http://doi.acm.org/10.1145/3331184.3331327. [13] M. Potthast, T. Gollub, M. Wiegmann, B. Stein, TIRA Integrated Research Architecture, in: N. Ferro, C. Peters (Eds.), Information Retrieval Evaluation in a Changing World, The Information Retrieval Series, Springer, Berlin Heidelberg New York, 2019.