Team Skeletor at Touché 2021: Argument Retrieval and Visualization for Controversial Questions Notebook for the Touché Lab on Argument Retrieval at CLEF 2021 Kevin Ros1,2 , Carl Edwards1,2 , Heng Ji1 and ChengXiang Zhai1 1 University of Illinois at Urbana-Champaign, 201 N Goodwin Ave, Urbana, Illinois 61801, U.S.A. 2 Equal Contribution Abstract Arguments are a critical part of education and political discourse in society, especially since more and more information is available online. In order to access this information, argument retrieval is a neces- sary task. In this work, we leverage the existing techniques of BM25 and BERT-based passage embedding similarity and introduce a new information retrieval technique based on manifold approximation. Eval- uation results on the Touché @ CLEF 2021 topics and relevance scores show that the manifold-based approximation helps discover higher-quality arguments. Furthermore, we use these retrieval methods to visualize argument progression for users watching debates. The visualization results show promising directions for future exploration. Keywords information retrieval, argument, manifold approximation, visualization 1. Introduction Arguments are an important part of education and political discourse in society. As the amount of information and social media use grows on the internet, especially surrounding controversial topics, it is critical to improve access to relevant debates, thereby improving public understanding of divisive topics [1, 2]. Furthermore, traditional search engines are often limited in their ability to effectively display and update relevant information during a live debate, especially when the debate topics are constantly changing. This paper attempts to address these concerns by investigating both argument retrieval and visualization. More specifically, we participate in Touché @ CLEF 2021 [3, 4], which presents two distinct argument retrieval tasks: retrieving arguments for controversial questions and retrieving arguments for comparative questions. We focus on the first task, with the goal of supporting users by retrieving and visualizing relevant arguments and sentences for controversial questions. This argument retrieval task goes beyond traditional information retrieval because the retrieval methods need to capture both relevance and argument strength. As the basic retrieval models have performed well on this task [5], in addition to the standard baseline BM25 and BERT embedding-based retrieval we explore a new approach in which we CLEF’21: Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " kjros2@illinois.edu (K. Ros); cne2@illinois.edu (C. Edwards); hengji@illinois.edu (H. Ji); czhai@illinois.edu (C. Zhai) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) leverage the properties of manifold approximation, which is commonly used for dimension- ality reduction [6], as a pseudo-relevance-feedback reranking approach. The manifold-based reranking approach assumes the highest-ranked initially retrieved arguments are relevant to the controversial question, and computes a directed-edge existence probability from each argument to all other arguments in the corpus. Our hypothesis is that strong, complete, and relevant arguments will have many other argu- ments “pointing" to it. That is, these arguments should have many high-probability incoming directed edges. Thus, we rerank the arguments based on the aggregation of their incoming edge probabilities. Furthermore, we build on these retrieval approaches to visualize the topics and trajectory of real-time debates as they progress per word with respect to a reference corpus. Experiments using the args.me corpus [7] and the Touché @ CLEF 2021 [5] topics and rele- vance scores show that our manifold-based ranking formula improves upon BM25 in argument quality. Additionally, our exploration of visualization techniques using the args.me corpus and a spoken debate shows promise in the direction of debate summarization and augmentation. 2. Related Work Our retrieval methods are inspired by passage-level evidence, as we treat each argument as a collection of sentences [8]. We follow the general methods described by SBERT’s retrieval and re-ranking.1 Zhao et al. [9] use manifold-based text representations of sentences in the biomedical domain to capture the geometric relationships between sentences. Other work also incorporates manifold learning into text representations [10, 11, 12]. To our knowledge, we are the first to incorporate sentence-level manifold representations into information retrieval. Regarding conversation augmentation, Lyons et al. investigate leveraging dual-purpose speech, which they define as speech socially appropriate to humans and meaningful to comput- ers [13]. Their software plays the role of an assistant (recording dates, scheduling events) rather than introducing additional knowledge to the conversations, which is what we aim to do. Boyd et al. propose to augment conversations with prosody information to help users with autism detect atypical prosody [14]. We attempt to introduce similar metadata to the debates (however, in the form of conversational topics) as well as introduce additional arguments directly related to the topics being discussed. Popescu-Belis et al. introduce a speech-based just-in-time retrieval system which uses semantic search [15]. That is, they record and transcribe conversations, and provide relevant documents to the participants of the conversation in real-time. Their search methods are based on keywords previously spoken during the conversation using ASR (automatic speech recognition) [16]. A word is considered a keyword if it is in the ASR transcript and is not a stopword, or if it is in a pre-constructed list. Thus, the search queries are limited to what has already been spoken, and high-level dependencies between previously discussed ideas cannot be leveraged. We believe our visualization approaches better address both of these issues. There has also been much work in the general field of visualizing information retrieval [17], but none of these approaches combine BERT and manifold-based dimensionality reduction to allow for more fine-grained understanding of arguments over time. 1 https://www.sbert.net/examples/applications/retrieve_rerank/README.html 3. Argument Retrieval In the following subsections, we describe our argument retrieval methods and results. Each approach retrieves arguments from the args.me corpus (version 2020-04-01), which consists of 387,740 arguments scraped from various online debate portals [7]. For each argument entry in the corpus, we only consider the text in the “premise" field. Our methods are primarily evaluated using the topics and relevance scores from Touché @ CLEF 2021, and we also include the scores of our methods on last year’s iteration of the competition for completeness. The relevance scores from last year consist of −2 (non-argument) or a range from 1 (low relevance, weak argument) to 5 (high relevance, strong argument. This year’s relevance scores use the same range, however, they consist of two separate dimensions: argument relevance and argument quality. There are 50 distinct topics each consisting of a short “title" field and a longer “description" and “narrative" fields. For our queries, we only use the “title" field. Some examples of “title" fields include “Do we need sex education in schools?" and “Should stem cell research be expanded?". 3.1. Methods 3.1.1. BM25 For our baseline approach, we use BM25. BM25 is a bag-of-words ranking formula that relies on keyword matching between a query and a collection of arguments, along with various weighting heuristics. To process, index, and search arguments, we use Pyserini, which is a Python-based information retrieval toolkit built over Anserini and Lucene [18]. All argument premises are processed and indexed using the default Pyserini settings. This includes stopword removal and stemming. All queries are also processed similarly. We use Pyserini’s provided BM25 implementation to search the corpus, only adjusting the 𝑘1 and 𝑏 parameters. We tune the parameters on last year’s topics and relevance scores. 3.1.2. Semantic Search Given that BM25 only matches exact terms, we explore the effectiveness of encoder-based 𝑘 nearest neighbor search to help bridge potential vocabulary gaps. To do this, we first split the premises of each argument by sentence into smaller passages of approximately 200 words each. Then, we encode each passage using the msmarco-distilbert-base-v3 encoder model provided by Sentence Transformers [19]. At a high level, msmarco-distilbert-base-v3 is a BERT-based [20] Siamese sentence encoder fine-tuned for question-answering on the MS MARCO data set [21]. The passage embeddings are stored and indexed using the hnwslib Python library [22], which provides an approximate nearest-neighbor lookup index using hierarchical navigable small world graphs. Each topic title is also encoded using msmarco-distilbert-base-v3, and given the encoded topic, we search for the approximate top 𝑘 nearest neighbor passages. The top arguments are ordered based on the maximum cosine similarity between the topic and any of its passages. All parameters are again tuned using the previous iteration of the task. We also investigate combining the scores returned via semantic search with those returned using BM25. To calculate this, we use the following formula: 𝑠𝑐𝑜𝑟𝑒𝐵𝑀 25 + 𝛼 × 𝑠𝑐𝑜𝑟𝑒𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐 3.1.3. Manifold Approximation Our third argument retrieval approach attempts to leverage the techniques utilized in UMAP (Uniform Manifold Approximation and Projection) [6]. UMAP is a dimensionality reduction technique that first approximates a uniform manifold for each data point and patches together their local fuzzy simplicial set representations, where a simplicial set is a higher-dimensional generalization of a directed graph. Then, this topoligical representation is used to assess and optimize lower-dimensional representations. A full theoretical description of UMAP is beyond the scope of this paper, so we focus solely on the computational aspects of UMAP’s manifold approximation which are relevant to our retrieval approach. To approximate a uniform manifold for each data point 𝑥𝑖 , UMAP first finds the 𝑘 nearest neighbors to 𝑥𝑖 . Then, it defines 𝜌𝑖 and 𝜎𝑖 , where 𝜌𝑖 = 𝑚𝑖𝑛{𝑑(𝑥𝑖 , 𝑥𝑖𝑗 )|1 ≤ 𝑗 ≤ 𝑘, 𝑑(𝑥𝑖 , 𝑥𝑖𝑗 ) > 0}, (1) 𝑘 ∑︁ −𝑚𝑎𝑥(0, 𝑑(𝑥𝑖 , 𝑥𝑖𝑗 ) − 𝜌𝑖 ) exp( ) = log2 (𝑘), (2) 𝜎𝑖 𝑗=1 and 𝑑(𝑥𝑖 , 𝑥𝑖𝑗 ) is the distance between 𝑥𝑖 and 𝑥𝑖𝑗 . Intuitively, 𝜌𝑖 is the distance to 𝑥𝑖 ’s closest neighbor (in our case, the most similar passage) and 𝜎𝑖 smooths and normalizes the distances to the nearest neighbors. Next, UMAP calculates the following weights between data points: −𝑚𝑎𝑥(0, 𝑑(𝑥𝑖 , 𝑥𝑖𝑗 ) − 𝜌𝑖 ) 𝑤((𝑥𝑖 , 𝑥𝑗 )) = exp( ). (3) 𝜎𝑖 Calculating this for every data point 𝑥𝑖 results in a 𝑘-granularity weighted adjacency matrix between all points in the data. The authors of UMAP note that 𝑤((𝑥𝑖 , 𝑥𝑗 )), or entry 𝑖, 𝑗 of the weighted adjacency matrix, can be interpreted as the probability that a directed edge from 𝑥𝑖 to 𝑥𝑗 exists. For the purposes of argument retrieval, our hypothesis is that strong, complete, and relevant arguments will have many other arguments “pointing" to it. That is, these arguments should have many high-probability incoming directed edges. Thus, for a given topic title, we first search using the aforementioned interpolated BM25 and semantic retrieval methods. We encode all of the passages for the top 𝑛 arguments. Next, for each encoded passage, we find the 𝑘 nearest neighbors and calculate (1), (2), and (3) as described above. Finally, we score each argument by the sum of all directed edges pointing to the argument. Note that the sum of these calculated passage weights possess different properties than just the sum of the passage similarities. Most notably, Equation 2 constrains the scaled sum of distances to log2 (𝑘), where k is the number of nearest neighbors. Our understanding is that this calculation gives importance to points that have fewer highly-similar (closer) neighbors. For example, if we have two points (x) and (y), and the (point,distance) pairs of their three nearest neighbors are (𝑥) : [(𝑎, 0.1), (𝑏, 0.2), (𝑐, 0.9)] (𝑦) : [(𝑑, 0.1), (𝑒, 0.2), (𝑓, 0.3)] Table 1 Performance on Touché 2021 and 2020 Topics and Relevance Scores Run Name Relevance nDCG@5 Quality nDCG@5 2020 nDCG@5 bm25 0.661 0.822 0.6214 semantic 0.570 0.671 0.3475 bm25-0.7semantic 0.667 0.815 0.6347 manifold 0.666 0.827 0.5417 manifold-c10 0.666 0.818 0.5906 then the weight between (x) and (b) will be higher than (y) and (e), even though they have the same relative distances. Here are the resulting weights from the manifold calculation (𝜎𝑥 = 0.179741, 𝜎𝑦 = 0.113319): (𝑥) : [(𝑎, 1), (𝑏, 0.5733), (𝑐, 0.0117)] (𝑦) : [(𝑑, 1), (𝑒, 0.4138), (𝑓, 0.1712)] Intuitively, this may help reduce importance of passages that are similar to many other passages, as that passage will contribute lower weights to other passages. 3.2. Results We submitted five runs to Touché 2021, and the performance measures for these five runs are listed in Table 1. The 2021 runs are judged in two dimensions: argument relevance and argument quality, which correspond to the second and third columns of the table, respectively. We also include the performance of our retrieval models on the topics and relevance scores from Touché 2020 as a reference, see column three. All measures are calculated using normalized discounted cumulative gain at five (nDCG@5). The first run, “bm25", corresponds to the approach outlined in Section 3.1.1. We tune the parameters using grid search and arrived at 𝑘1 = 3.2 and 𝑏 = 0.2 using the 2020 topics and relevance scores. The next row, “semantic", corresponds to Section 3.1.2. We set the number of nearest neighbors 𝑘 = 1000 for each topic. Next, “bm25-0.7semantic" denotes the interpolation of the two aforementioned approaches, with an 𝛼 value of 0.7. The final two rows correspond to the approach described in Section 3.1.3. For “manifold", we assume the top 3 arguments from “bm25-0.7semantic" are relevant and search for 𝑘 = 50 nearest neighbors for each argument passage. The retrieved passages are completely reranked by aggregating the weights over each argument. For “manifold-c10", we perform the exact same search, but only rerank the top 10 arguments of the “bm25-0.7semantic" run. For this year’s evaluations, our best-performing run with respect to relevance is “bm25- 0.7semantic". However, all of our other runs which utilize BM25 (i.e., excluding “semantic") perform similarly. With respect to quality, our best-performing run is “manifold". Here, it is promising that “manifold" outperformed “manifold-c10", as this implies that the manifold technique is able to increase argument quality by retrieving arguments outside of the top 10 initially-ranked arguments. It is unclear whether or not our initial hypothesis is supported by the scores listed in Table 1. The evaluation metrics from this year seem to support our hypothesis in the context of our “manifold" run, but last year’s results show a decrease in performance. This may be because last year’s relevance scores combine many different measures into a single dimension. Furthermore, it is difficult to separate out the effects of BM25 on our manifold-based approaches, since it appears that these approaches perform similarly. This, along with the high scores of our “bm25" run, stresses the importance of well-tuned robust models. Overall, these results are a step in the right direction for our hypothesis, but more analysis is needed to draw firm conclusions. 4. Visualization While a ranked list of document snippets is often sufficient for ordinary web search, such a list is not necessarily optimal for showing results of argument retrieval to the users because it is common to discuss many topics during a debate and the user may want to see the topical structure. These topics may be discussed at length, briefly mentioned, or revisited as the debate unfolds. Traditional search engines, which require explicit user querying, often display relevant documents and arguments in a ranked list, which makes it difficult to effectively capture and visualize these topic changes. For example, it may be too time consuming for a participant in a debate to constantly search for and read all of the relevant documents. Or, someone may want a high-level summary of the debate at various points. Thus, we explore various visualization techniques to help mitigate these concerns. This is accomplished by minimizing the necessity of constant user input as well as visualizing these structural topic changes. Visualization of search results has been studied before [17, 23, 24, 25]; however, existing visualization methods will not work well for our use case, so we explore new approaches. For our visualization exploration, we utilize the args.me corpus to help summarize and augment debates in real time. We demonstrate our visualization methods on the publicly- available debate between Bill Nye and Ken Ham on Evolution vs. Creationism.2 We chose this debate primarily because YouTube provides an accurate transcript of the debate with timestamps, and because of the debate’s diverse topic coverage. The YouTube transcript timestamps occur approximately every 3 seconds and contain ap- proximately 1 − 8 words per timestamp. We maintain these groupings for our analysis. The text for the transcript referenced in the analysis is in Table 4. The full text of each referenced argument ID is available on GitHub.3 4.1. Visualization Approach with BM25 For any given timestamp 𝑡𝑖 , we define a look-back window of size 𝑛 and collect all the terms that occurred between 𝑡𝑖−𝑛 and 𝑡𝑖 . Then, we search the args.me corpus using our BM25 retrieval approach outlined in Section 3.1.1, with the query being the collected transcript terms. We record the ranks of the top 𝑘 arguments returned. We choose BM25 because it is well-known to be robust and efficient. Repeating this over a given interval of timestamps results in a smoothed argument-level summary for the interval. 2 https://www.youtube.com/watch?v=z6kgvhG3AkI 3 https://github.com/kevinros/toucheRetrievalVisualization/tree/main/arguments Figure 1: Rankings of the five most frequent arguments over the transcript window 110:53 - 114:04. Table 2 Arguments from Figure 1 Argument ID Representative Topic(s) S4fde9bb-Aef3913d8 description of a bicycle incident S9a8b0a09-A22358c86 understanding scriptures, the gospel, God S690aacea-A986a10d7 creator of universe, infinite power, God S23dda237-A69f9884f having unlimited power, omnipotence of God S5059e885-Abe1aa26d the justness of God As an example, consider the debate time interval 110:53-114:04. Each timestamp and corre- sponding text is listed in Table 4. We define a look-back window of size 𝑛 = 5 and retrieve the top 𝑘 = 20 arguments for each timestamp. Then, we collect the number of times an argument is ranked in the top 20 arguments across all timestamps, and consider only the five most fre- quent arguments. Figure 1 displays the ranks of these five argument at each timestamp, and Table 2 lists a high-level description of each argument. The parameters are manually tuned to demonstrate the benefits and drawbacks of this visualization approach. Of the five arguments returned, S4fde9bb-Aef3913d8 seems to be topically irrelevant to the transcript text. Interestingly, this argument appears to also be a transcript, and thus it contains many filler words (such as “uh") also present in the debate transcript. It appears to be playing the role of a background language model. The other four arguments seem to be relevant as they discuss topics and themes present in the transcript at different timestamps. From 111:29 to 111:50, argument S9a8b0a09-A22358c86 is one of the highest-ranked, and it discusses “God", “His kingdom", “scripture", and “His actions". From 112:22 to 112:47, we find that arguments Table 3 Arguments from Figure 2 Argument ID Representative Topic(s) S379f0b2-Ab47bd29b showing the validity of theistic evolution S56a34f98-A3adb8db7 biblical creationism, unfalsifiable Scf918055-Af439fe9a heaven, hell, stars, God S70cdd68a-A5b15aee9 physics, star formation, modern science S9ad5951e-A78e904a7 astronomy in the context of the Quran S690aacea-A986a10d7 and S23dda237-A69f9884f are ranked the highest. Both arguments discuss the powers of the creator of the universe. From 112:53 to 113:10, we observe that argument S5059e885-Abe1aa26d is the highest-ranked, which argues in favor of the justness of God. One use case for this visualization technique is to help participants of the debate better analyze and justify their stance. For example, the participants can draw on the additional knowledge provided by the retrieved arguments to strengthen their own arguments in real-time. On the other hand, it is also possible that rebuttals to participants’ arguments will be retrieved, which could help increase the overall robustness of the debate by exposing counterpoints. In order to reduce noise and irrelevant arguments, we also explore the possibility of allowing users to specify the search terms or arguments. More specifically, using pre-defined sets of terms, we search the args.me corpus with BM25 to find the most relevant arguments to the provided terms. Then, we display the frequencies of the returned arguments using the methods outlined above, except we consider ranks through 100 rather than 20. Consider the same debate time interval and the keyword groups “bible god creationism" and “heavens astronomy stars". Figure 2 displays the frequencies of the five most relevant arguments to each keyword group. The first five argument IDs in the legend correspond to the first keyword group, and the second five argument IDs in the legend correspond to the second keyword group. Additionally, high-level descriptions of the arguments that appear in Figure 2 are listed in Table 3. The first two arguments are from “bible god creationism" and the last three arguments correspond to “heavens astronomy stars". From Figure 2, we see that arguments relevant to both keyword groups are highly ranked between 112:10 and 112:36, indicating that the keywords in the retrieved arguments strongly match the keywords from the debate transcript in the time interval. An important benefit of this visualization technique is that it allows the user to specify specific topics before, during, or after a debate in order to easily track various topic occurrences for further analysis. For example, a user looking to get a high-level summary of a debate can examine the ranking frequencies of known arguments in order to pinpoint the most relevant points in the debate. As this visualization approach provides a high-level overview of a debate by referencing relevant arguments using keywords, it abstracts away from the actual content of the debate and relevant sentences within arguments. To help address this issue, we explore a more fine-grained visualization approach in the following subsection. Figure 2: Rankings of the five most relevant arguments to “bible god creationism" and to “heavens astronomy stars" over the transcript window 110:53 - 114:04. 4.2. Visualization Approach with UMAP The advent of new Transformer-based language models such as BERT [20] have lead to impres- sive improvement on a variety of NLP tasks. We seek to use BERT’s semantic representation space to better visualize the dynamics of arguments. To do so, we take advantage of UMAP [6]. The goal of UMAP is to visualize high-dimensional embeddings in a low-dimensional space while preserving topological and structural properties. Using the same BERT-based encoder discussed in Section 3.1.2, we combine the encodings of the sentences of relevant arguments and the “caterpillar embeddings" of our debate transcript to visualize how the debate evolves over time. This approach allows us to analyze fine-grained topic changes as they unfold in the debate, as well as their relevance to a reference corpus. 4.2.1. Caterpillar Embeddings Caterpillar embeddings are used to track the course of the debate over time. They consist of a sequence of encoder representations taken from across the debate. A naïve approach is to slide a window of size 𝑛 over the sequence of words 𝑤 in the transcript with stride 𝑠. However, this has the downside of both adding and removing information (words) at each step. Instead, we split each step into two: a growth step and a contraction step. Given a window from word 𝑤𝑖 to 𝑤𝑖+𝑛 of the transcript for some 𝑖, the next window will grow to be from 𝑤𝑖 to 𝑤𝑖+𝑛+𝑠 . The subsequent window will be a contraction: it will range from 𝑤𝑖+𝑠 to 𝑤𝑖+𝑛+𝑠 Hence, this “caterpillar embedding" technique moves along the transcript of the debate like a caterpillar inching along. At step 𝑡, the start and end of the window, 𝑆 and 𝐸 respectively, are calculated as follows: 𝑆 = 𝑤𝑠⌊ 𝑡 ⌋ , 𝐸 = 𝑤𝑠⌊ 𝑡+1 ⌋+𝑛 2 2 4.2.2. Argument Retrieval-Based Semantic Visualizations In order to better define the topology of the semantic space, we extract the top 𝑘 most frequent arguments over the transcript interval 110:53 to 127:01 as described in Section 4.1 from the args.me corpus, split them into sentences, and encode the sentences using the previously- mentioned BERT-based sentence encoder. We combine these argument embeddings with the caterpillar embeddings of the debate transcript and project them into two dimensions using UMAP. This creates a path of the debate as it visits different arguments in the semantic space. We can then use the nearby neighbors of the caterpillar embeddings as relevant arguments to show the user at a given timestamp. The full animation can be found on GitHub.4 Regardless of which 𝑘 value we use, we find that this UMAP projection does not preserve the original space well regarding nearest neighbors. We believe this is because of the large differences between the semantic structures of the conversational YouTube debate and the written structures of the corpus debates. To mitigate this, we use a nearest neighbor search in the original space, and we plot the debate embedding using its 𝑚 nearest neighbors. Through empirical exploration, we find that 𝑚 = 100 and 𝑘 = 100 yields the clearest results. Additionally, we consider the same window as explored in Section 4.1, namely 110:53 to 114:04. Note that the transcript of the debate in this window is available in Table 4. The resulting path at various timestamps is shown in Figure 3. The argument quickly moves to the lower left quadrant, which we find to signify the creation of the universe and heavens, particularly in relation to God. The path briefly moves to the right, when the debate focuses more on the omnipotence and omniscience of God. Finally, the debate moves upward, when the discussion changes to physics, life science, and astronomy. The full video can also be found on GitHub.5 In Figure 3, we clearly see groupings of arguments’ topics and how they change over time. Interestingly, we can also examine the topic path through the corpus that the YouTube debate took. This could be used to track debate topic progression in a visual manner, and augment live debates with both relevant information at the current point as well as relevant information for future, forecasted points. More work is needed, however, to investigate the effects of parameter selection and the effectiveness in various domains. 5. Conclusion and Future Work In this work, we apply several techniques to the Touché Argument Retrieval task, such as BM25, semantic search, and manifold-based reranking. Among them, we find that the manifold-based reranking was sometimes more effective in returning high-quality arguments when compared to BM25. In the future, we hope to compute the manifold weights for every argument in the data set as a preprocessing step, and investigate efficient ways to combine these weights with 4 https://github.com/kevinros/toucheRetrievalVisualization/blob/main/animations/full_anim.mp4 5 https://github.com/kevinros/toucheRetrievalVisualization/blob/main/animations/100top_mean_anim.mp4 (a) Initial frame. Time: 111:10 (b) Time: 111:53 (c) Time: 113:27 (d) Final frame. Time: 114:04 Figure 3: Visualization of the evolution-creationism debate through the retrieved argument space. retrieval methods that perform well along the relevance dimension, in order to return the strongest and the most relevant arguments. To better display search results to users in argument retrieval, we also introduce various visualization techniques based on BM25 keyword matching and UMAP dimensionality reduction, which shows promise in the direction of debate augmentation. Although the benefits of this augmentation are difficult to quantify, we believe it will help improve debate understanding and retention, as well as open up avenues for future work. We also hope to improve the visualization by further testing different parameters, retrieval techniques, and background corpora. Table 4 Transcript of creationism debate from 110:53 to 114:04 Timestamp Text Timestamp Text 110:53 creationism account for the celestial 112:23 um and just to show us he’s an 110:55 bodies 112:25 all-powerful god he’s an infinite god so 110:56 planets stars moons moving further and 112:27 i made the stars and he made them to 110:59 further apart 112:29 show us how great he is and he is he’s 111:00 and what function does that serve in the 112:31 an 111:02 grand design 112:32 infinite creator god and the more that 111:04 well when it comes to uh looking at the 112:34 you understand what that means that god 111:06 universe of course we believe that in 112:36 is all-powerful infinite you stand back 111:08 the beginning god created the heavens 112:39 in all 111:09 and the earth 112:39 you realize how small we are you realize 111:10 and i believe our uh creationist 112:42 wow that god would consider this planet 111:12 astronomers would say yeah you can 112:44 is is so significant that he created 111:13 observe 112:47 human beings here 111:14 the universe expanding uh why god is 112:48 knowing they would sin and yet stepped 111:16 doing that in fact in the bible it even 112:50 into history to die for us be raised 111:17 says he stretches out 112:52 from the dead 111:19 the heavens and seems to indicate that 112:53 to offer us a free gift to salvation wow 111:22 there is 112:56 what a god and that’s what i would say 111:22 an expansion of the universe and so 112:58 when i see 111:26 we would say yeah that you can observe 112:59 the universe as it is mr nye one minute 111:27 that that fits with 113:02 any response 111:29 what we call observational science 113:03 there’s a question that troubles us all 111:30 exactly why god did it that way 113:05 from the time we are 111:32 uh i can’t answer that question of 113:08 absolutely youngest and first able to 111:34 course uh because you know the bible 113:10 think 111:36 says that uh 113:11 and that is where did we come from where 111:37 god made uh the heavens for for his 113:13 did i come from 111:39 glory and that’s why he made 113:15 and this question is so compelling that 111:41 uh the stars that we see out there and 113:18 we’ve 111:44 it’s uh it’s to tell us how great he is 113:19 invented the science of astronomy we’ve 111:46 and how big he is and in fact i think 113:22 invented life science we’ve invented 111:48 that’s the the thing about the universe 113:24 physics we’ve discovered these natural 111:49 the universe is 113:26 laws 111:50 so large so big out there one of our 113:27 so that we can learn more about our 111:53 planetarium programs 113:29 origin and where we came from 111:54 looks at this we go in and show you uh 113:31 to you when it says he invented the 111:57 how large the universe is 113:34 stars also 111:59 and i think it shows us how great god 113:36 that’s satisfying you’re done oh good 112:02 is uh how big he is that he’s an 113:39 okay to me when i look at the night sky 112:04 all-powerful god he’s an infinite god 113:41 i want to know what’s out there 112:07 uh an infinite all-knowing god who 113:43 i’m driven i want to know if what’s out 112:09 created the universe 113:46 there is any part of me 112:10 to show us his power i mean can you 113:48 and indeed it is the oh by the way 112:13 imagine that and the thing that’s 113:51 i find compelling you are satisfied and 112:14 remarkable 113:55 the big thing i want from you 112:15 in the bible for instance says on the 113:56 mr ham is can you come up with something 112:17 fourth day of creation 113:59 that you can predict 112:18 and and oh he made the stars also it’s 114:00 do you have a creation model that 112:21 almost like oh by the way i made the 114:02 predicts something that will happen in 112:22 stars 114:04 nature References [1] A. Perrin, Social media usage, Pew research center 125 (2015) 52–68. [2] M. Del Vicario, G. Vivaldo, A. Bessi, F. Zollo, A. Scala, G. Caldarelli, W. Quattrociocchi, Echo chambers: Emotional contagion and group polarization on facebook, Scientific reports 6 (2016) 1–12. [3] A. Bondarenko, L. Gienapp, M. Fröbe, M. Beloucif, Y. Ajjour, A. Panchenko, C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2021: Argument Retrieval, in: D. Hiemstra, M.-F. Moens, J. Mothe, R. Perego, M. Potthast, F. Sebastiani (Eds.), Advances in Information Retrieval. 43rd European Conference on IR Research (ECIR 2021), volume 12036 of Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2021, pp. 574–582. URL: https://urldefense. com/v3/__https://link.springer.com/chapter/10.1007/978-3-030-72240-1_67__;!!DZ3fjg! qiIStvQ7N0tMq0XWzNrBDwdUszdG_1Cm5f0npcVKkP9lL7BwqrITiN5eveoZNiWt_Q$. doi:10.1007/978-3-030-72240-1\_67. [4] A. Bondarenko, L. Gienapp, M. Fröbe, M. Beloucif, Y. Ajjour, A. Panchenko, C. Biemann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2021: Argument Retrieval, in: G. Faggioli, N. Ferro, A. Joly, M. Maistro, F. Piroi (Eds.), Working Notes Papers of the CLEF 2021 Evaluation Labs, CEUR Workshop Proceedings, 2021. [5] A. Bondarenko, M. Fröbe, M. Beloucif, L. Gienapp, Y. Ajjour, A. Panchenko, C. Bie- mann, B. Stein, H. Wachsmuth, M. Potthast, M. Hagen, Overview of Touché 2020: Ar- gument Retrieval, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), Working Notes Papers of the CLEF 2020 Evaluation Labs, volume 2696 of CEUR Workshop Pro- ceedings, 2020. URL: https://urldefense.com/v3/__http://ceur-ws.org/Vol-2696/__;!!DZ3fjg! qiIStvQ7N0tMq0XWzNrBDwdUszdG_1Cm5f0npcVKkP9lL7BwqrITiN5ever8RPesww$. [6] L. McInnes, J. Healy, J. Melville, Umap: Uniform manifold approximation and projection for dimension reduction, arXiv preprint arXiv:1802.03426 (2018). [7] Y. Ajjour, H. Wachsmuth, J. Kiesel, M. Potthast, M. Hagen, B. Stein, Data Acquisition for Argument Search: The args.me corpus, in: C. Benzmüller, H. Stuckenschmidt (Eds.), 42nd German Conference on Artificial Intelligence (KI 2019), Springer, Berlin Heidelberg New York, 2019, pp. 48–59. doi:10.1007/978-3-030-30179-8\_4. [8] J. P. Callan, Passage-level evidence in document retrieval, in: SIGIR’94, Springer, 1994, pp. 302–310. [9] D. Zhao, J. Wang, H. Lin, Y. Chu, Y. Wang, Y. Zhang, Z. Yang, Sentence representation with manifold learning for biomedical texts, Knowledge-Based Systems 218 (2021) 106869. [10] T. B. Hashimoto, D. Alvarez-Melis, T. S. Jaakkola, Word embeddings as metric recovery in semantic spaces, Transactions of the Association for Computational Linguistics 4 (2016) 273–286. [11] S. Hasan, E. Curry, Word re-embedding via manifold dimensionality retention, Association for Computational Linguistics (ACL), 2017. [12] B. Jiang, Z. Li, H. Chen, A. G. Cohn, Latent topic text representation learning on statistical manifolds, IEEE transactions on neural networks and learning systems 29 (2018) 5643–5654. [13] K. Lyons, C. Skeels, T. Starner, C. M. Snoeck, B. A. Wong, D. Ashbrook, Augmenting conversations using dual-purpose speech, in: Proceedings of the 17th annual ACM symposium on User Interface Software and Technology, 2004, pp. 237–246. [14] L. E. Boyd, A. Rangel, H. Tomimbang, A. Conejo-Toledo, K. Patel, M. Tentori, G. R. Hayes, Saywat: Augmenting face-to-face conversations for adults with autism, in: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, 2016, pp. 4872–4883. [15] A. Popescu-Belis, M. Yazdani, A. Nanchen, P. N. Garner, A speech-based just-in-time retrieval system using semantic search, Technical Report, Idiap, 2011. [16] P. N. Garner, J. Dines, T. Hain, A. El Hannani, M. Karafiat, D. Korchagin, M. Lincoln, V. Wan, L. Zhang, Real-time ASR from meetings, Technical Report, Idiap, 2009. [17] M. Hearst, Search user interfaces, Cambridge university press, 2009. [18] J. Lin, X. Ma, S.-C. Lin, J.-H. Yang, R. Pradeep, R. Nogueira, Pyserini: An easy-to-use python toolkit to support replicable ir research with sparse and dense representations, arXiv preprint arXiv:2102.10073 (2021). [19] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, arXiv preprint arXiv:1908.10084 (2019). [20] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [21] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, L. Deng, Ms marco: A human generated machine reading comprehension dataset, in: CoCo@ NIPS, 2016. [22] Y. A. Malkov, D. A. Yashunin, Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs, IEEE transactions on pattern analysis and machine intelligence 42 (2018) 824–836. [23] S. Liu, X. Wang, C. Collins, W. Dou, F. Ouyang, M. El-Assady, L. Jiang, D. A. Keim, Bridging text visualization and mining: A task-driven survey, IEEE transactions on visualization and computer graphics 25 (2018) 2482–2504. [24] A. M. MacEachren, A. Jaiswal, A. C. Robinson, S. Pezanowski, A. Savelyev, P. Mitra, X. Zhang, J. Blanford, Senseplace2: Geotwitter analytics support for situational awareness, in: 2011 IEEE conference on visual analytics science and technology (VAST), IEEE, 2011, pp. 181–190. [25] J. Peltonen, K. Belorustceva, T. Ruotsalo, Topic-relevance map: Visualization for improving search result comprehension, in: Proceedings of the 22nd international conference on intelligent user interfaces, 2017, pp. 611–622.