<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of Touché 2023: Argument and Causal Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Extended Version</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Bondarenko</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maik Fröbe</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Johannes Kiesel</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ferdinand Schlatt</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valentin Barriere</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Brian Ravenet</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Léo Hemamou</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simon Luck</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan Heinrich Reimer</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benno Stein</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Potthast</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthias Hagen</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sanofi R</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>D France</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Alma Mater Studiorum - Università di Bologna</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Leipzig University and ScaDS.AI</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper is a report on the fourth edition of the Touché lab on argument and causal retrieval hosted at CLEF 2023. With the goal of creating a collaborative platform for research on computational argumentation and causality, we organized four shared tasks: (a) argument retrieval for controversial topics (retrieve web documents that contain high-quality argumentation and detect the documents' stances), (b) causal retrieval (retrieve web documents that contain causal statements and detect the documents' causal stances), (c) image retrieval for arguments (retrieve images that support a pro or con stance towards some controversial topic), and (d) multilingual multi-target stance classification (detect the stance of comments on proposals from an online multilingual participatory democracy platform).</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Argument retrieval</kwd>
        <kwd>Causal retrieval</kwd>
        <kwd>Image retrieval</kwd>
        <kwd>Stance classification</kwd>
        <kwd>Argument quality</kwd>
        <kwd>Causality</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Making informed decisions and forming opinions on a matter often involves not only weighing
pro and con arguments but also considering cause–efect relationships [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. To make decisions or
to get an overview of diferent standpoints on some topic, a lot of facts, opinions, arguments, etc.
can be found on the Web. However, conventional web search engines are primarily optimized
for returning relevant results that match a query but not for argument or causal analyses (e.g.,
*This overview extends the one published as part of the CLEF 2023 proceedings [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>CLEF 2023: Conference and Labs of the Evaluation Forum, September 18–21, 2023, Thessaloniki, Greece
argument quality or stance). To close this gap, with the Touché1 lab, we ofer a platform to
develop and test respective approaches. In 2023, we organized the following four shared tasks:
1. Retrieval of documents that contain arguments and opinions on some controversial topic.
2. Retrieval of documents that contain evidence on whether a causal relationship between
two events exists.
3. Retrieval of images to visually corroborate textual arguments and to provide a quick
overview of public opinions on controversial topics.
4. Stance classification of comments on proposals from the multilingual participatory
democracy platform CoFE2 to support opinion formation on socially important topics.</p>
      <p>
        The three retrieval tasks followed the traditional TREC3 methodology: document collections
and topics were provided to the participants, who submitted their results (up to five runs) for
each topic to be judged by human assessors. In the retrieval tasks, all teams used BM25 or
BM25F [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ] for first-stage retrieval. The final ranked lists (runs) were often created (1) based
on argument quality estimation and predicted stance (Task 1), (2) based on the presence of
causal relationships in documents (Task 2), and (3) exploiting the contextual similarity between
images and queries and using the predicted stance for images (Task 3). The participants trained
feature-based and neural classifiers to predict argument quality or stance, and many also used
ChatGPT with various prompt-engineering methods. To predict the stance of multilingual texts
in Task 4, the participants used transformer-based models exploiting a few-step fine-tuning,
data augmentation, and label propagation techniques.
      </p>
      <p>
        The corpora, topics, and judgments created at Touché are freely available to the research
community and can be found on the lab’s website.4 Parts of the data are also already available
via the BEIR [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and ir_datasets [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] resources.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Lab Overview and Statistics</title>
      <p>
        We used TIRA [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] as the submission platform for Touché 2023 through which the participants
could either submit software or upload run files. 5 We particularly encouraged software
submissions, as they increase reproducibility and also allow for later running the software on diferent
data with the same format (e.g., on topics and collections from a previous year). To submit
software, a team had to deploy their approach in a Docker image that they then uploaded to
their dedicated Docker registry in TIRA. Software submissions in TIRA are immutable, and after
a Docker image has been submitted, a team could specify a to-be-executed command—thus,
the same Docker image could be used for multiple software submissions (e.g., by changing
some parameters). A team could upload as many Docker images as needed (the images were
not public while the shared tasks were ongoing). To improve reproducibility, TIRA executes
1‘Touché’ is commonly “used to acknowledge a hit in fencing or the success or appropriateness of an argument, an
accusation, or a witty point.” [https://merriam-webster.com/dictionary/touche]
2https://futureu.europa.eu
3https://trec.nist.gov/
4https://touche.webis.de/
5https://tira.io
software in a sandbox by blocking the internet connection. This ensures that the software
is fully installed in the Docker image, which simplifies running the software later. For the
execution, the participants could select the resources out of 4 options: (1) 1 CPU core with
10 GB RAM, (2) 2 cores with 20 GB RAM, (3) 4 cores with 40 GB RAM, or (4) 1 CPU core with
10 GB RAM and 1 Nvidia GeForce GTX 1080 GPU with 7 GB RAM. A software could be run
multiple times using diferent resources to investigate the scalability and reproducibility (e.g.,
whether the software executed on a GPU yields the same results as on a CPU). TIRA used a
Kubernetes cluster with 1,620 CPU cores, 25.4 TB RAM, and 24 GeForce GTX 1080 GPUs to
schedule and execute the software submissions, allocating the resources that the participants
selected for their submissions.
      </p>
      <p>Overall, for the fourth edition of the Touché lab, we received 41 registrations from 21 countries
(vs. 58 registrations in 2022). But from the 41 registered teams, only 7 teams actively participated
by submitting valid results (1 team in Task 1, 1 in Task 2, 3 in Task 3, and 2 in Task 4)—5 of
the 7 teams submitted software. Note that the number of active teams substantially decreased
compared to the previous editions of Touché (23 active teams in 2022, 27 in 2021, and 17 in 2020).
We thus decided to pause the argument and causal retrieval tasks for now.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Task 1: Argument Retrieval for Controversial Questions</title>
      <p>
        The goal of the first task was to support individuals who search for opinions and arguments on
socially important, controversial topics like “Are social networking sites good for our society?”.
The previous task iterations explored diferent granularities of argument retrieval and analysis: a
focused crawl of debates on various controversial topics from several online debating portals and
the arguments’ concise gist [
        <xref ref-type="bibr" rid="ref10 ref8 ref9">8, 9, 10</xref>
        ]. For the fourth edition of the task, our focus shifted towards
retrieving argumentative web documents from the web crawl corpus ClueWeb22-B [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The
topics and manual judgments from the previous task iterations were provided to the participants
to enable approaches that leverage training and parameter tuning.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Task Definition</title>
        <p>Given a controversial topic and a collection of web documents, the task was to retrieve and
rank documents by relevance to the topic, ideally also ranking higher documents that contain
high-quality arguments, and to (optionally) detect the document’s stance. Participants of Task 1
needed to retrieve documents from the ClueWeb22-B crawl for 50 search topics.</p>
        <p>
          To lower the entry barrier for participants who could not index the whole ClueWeb22-B
corpus on their side, we provided a first-stage retrieval possibility via the API of the
BM25Fbased search engine ChatNoir [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] and a smaller version of the corpus that contained one million
documents per topic. To identify arguments (claims and premises) in documents, participants
could use any existing argument tagging tool such as the TARGER API [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] hosted on our
servers or develop their own tools if necessary.
Democracy may be in the process of being disrupted by social media,
with the potential creation of individual filter bubbles. So a user
wonders if social networking sites should be allowed, regulated, or
even banned.
        </p>
        <p>Highly relevant arguments discuss social networking in general or
particular networking sites, and its/their positive or negative efects
on society. Relevant arguments discuss how social networking
afects people, without explicit reference to society.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Data Description</title>
        <p>Topics. For the task on argument retrieval for controversial questions (Task 1), we provided
50 search topics representing various debated societal issues. These issues were chosen from the
online debate portals (debatewise.org, idebate.org, debatepedia.org, and debate.org), with the
largest number of user-generated comments and thus representing the highest societal interest.
For each such case, we formulated a topic’s title (i.e., a question on a controversial issue), a
description specifying the particular search scenario, and a narrative that served as a guideline
for the human assessors (see Table 1 for an example).</p>
        <p>
          Document Collection. The retrieval collection was the ClueWeb22 (Category B) corpus [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]
that contains 200 million multilingual most frequently visited web pages like Wikipedia articles,
news websites, etc. The indexed corpus was available via the ChatNoir API6 and its Python
module7 integrated in PyTerrier [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Evaluation Setup</title>
        <p>
          Our human assessors labeled the ranked lists of documents submitted by the task participants
both for their general topical relevance and for the rhetorical argument quality [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], i.e.,
“wellwrittenness”: (1) whether the document contains arguments and whether the argument text has
a good style of speech, (2) whether the argument text has a proper sentence structure and is
easy to follow, and (3) whether it includes profanity, has typos, etc. Also, the documents’ stance
towards the search topics was labeled as ‘pro’, ‘con’, ‘neutral’, or ‘no stance’.
        </p>
        <p>
          Analogously to the previous Touché editions, our volunteer assessors annotated the
document’s topical relevance with three labels: 0 (not relevant), 1 (relevant), and 2 (highly relevant).
The argument quality was also labeled with three classes: 0 (low quality or no arguments in the
document), 1 (average quality), and 2 (high quality). We provided the annotators with detailed
6https://github.com/chatnoir-eu/chatnoir-api
7https://github.com/chatnoir-eu/chatnoir-pyterrier
annotation guidelines, including examples. In the training phase, we asked 4 annotators to
label the same 20 randomly selected documents (initial Fleiss’ kappa values: relevance  =0.39
(fair agreement), argument quality  =0.34 (fair agreement), and  =0.51 (moderate agreement)
for labeling the stance) and in the follow-up discussion clariefid potential misinterpretations.
Afterward, each annotator independently judged the results for disjoint subsets of the topics
(i.e., each topic was judged by one annotator only). We used this annotation policy due to a
high annotation workload. Our human assessors labeled in total 747 documents pooled from
8 runs using a top-10 pooling strategy implemented in the TrecTools library [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Submitted Approaches and Evaluation Results</title>
        <p>In 2023, only one team participated in Task 1 and submitted seven runs. We, thus, decided to
evaluate all the participant’s runs and an additional baseline. Below, we summarize and describe
the submitted approaches to the task and evaluation results.</p>
        <p>
          The task’s baseline run by Puss in Boots used the results that ChatNoir [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] returned for the
topics’ titles used as queries without any pre-processing. ChatNoir is an Elasticsearch-based
search engine for the ClueWeb and Common Crawl web corpora that employs BM25F ranking
(fields: document title, keywords, main content, and the whole document) and SpamRank
scores [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. The document stance for the baseline run was predicted by zero-shot prompting
the Flan-T5 model [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]8 after summarizing the document’s main content with BART [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ].9 The
summarization step was necessary to meet the Flan-T5 input limit of 512 tokens.
        </p>
        <p>
          Team Renji Abarai [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] submitted seven runs in total. Their baseline run used the
top10 results returned by ChatNoir for the pre-processed topics’ titles used as queries. During
pre-processing, stop words were rfist removed using their own handcrafted list of terms; the
remaining query terms were then lowercased and lemmatized with the Stanza NLP library [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ].
For the other six runs, the results of the baseline run were re-ranked based on the predicted
argument quality and predicted document stance. Argument quality was predicted using either
a meta-classifier (random forests) trained on the class predictions and class probabilities of six
base classifiers or by prompting ChatGPT. Each base classifier (feedforward neural network,
LightGBM [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], logistic regression, naïve Bayes, SVM, and random forests) was trained in two
variants: (1) using a set of 32 handcrafted features (e.g., sentiment, spelling errors, the ratio of
arguments in documents, etc.) and (2) using documents represented with the instruction-based
ifne-tuned embedding model INSTRUCTOR [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. All the classifiers were trained on the manual
argument quality labels from the Touché 2021 Task 1 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], which was also used to select examples
for few-shot prompting ChatGPT. The resulting ranked lists submitted by Renji Abarai difered
in the type of argument quality classifiers used for re-ranking, whether predicted classes or
probabilities were used, or if the predicted document stance was considered. The document
stance for all the runs was predicted using ChatGPT.
        </p>
        <p>Table 2 shows the results for all evaluated runs with respect to relevance, argument quality,
and stance detection (more detailed results for each submitted run, including the 95% confidence
intervals, are in Tables 10 and 11 in Appendix B). Overall, none of the submitted participant
8Pre-trained model: https://huggingface.co/google/flan-t5-base; maximum generated tokens: 3; the prompt is given
in Appendix A.
9Pre-trained model: https://huggingface.co/facebook/bart-large-cnn; minimum length: 64; maximum length: 256.
Rel.
Puss in Boots
Renji Abarai
Renji Abarai
Renji Abarai
Renji Abarai
Renji Abarai
Renji Abarai
Renji Abarai
results outperformed the argumentation-agnostic BM25F-based task baseline. This is due to
the worse efectiveness of the team’s initial retrieval results (‘team_baseline’ run in Table 2)
that were used in the re-ranking step. Five participants’ re-ranking strategies were able to
improve over their initial ranking. The most efective participant approach (‘stance_ChatGPT’
run in Table 2) exploited ChatGPT to predict the argument quality and stance. Then, a two-step
re-ranking strategy was used: (1) move the ‘no stance’ documents to the bottom of the ranked
list, and then (2) re-rank the remaining documents based on the predicted argument quality
labels in the descending order. Thus, the promising future direction can be to apply the proposed
re-ranking approach to the oficial task baseline run.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Task 2: Evidence Retrieval for Causal Questions</title>
      <p>The goal of the Touché 2023 lab’s second task was to support users who search for answers
to causal yes-no questions like “Do microwave ovens cause cancer?”, supported by relevant
evidence instances. In general, such causal questions ask if something causes or does not cause
something else.</p>
      <sec id="sec-4-1">
        <title>4.1. Task Definition</title>
        <p>
          Given a causality-related topic and a collection of web documents, the task was to retrieve and
rank documents by relevance to the topic. For 50 search topics, participants of Task 2 needed
to retrieve documents from the ClueWeb22-B crawl that contain relevant causal evidence. An
optional task was to detect the document’s causal stance. A document can provide supportive
evidence (a causal relationship between the cause and efect from the topic holds), refutative (a
causal relationship does not hold), or neutral (in some cases holds and in some does not). Like
in Task 1, ChatNoir [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] could be used for first-stage retrieval.
A user has recently learned that radiation waves can cause cancer.
        </p>
        <p>They are wondering if their microwave oven produces radiation
waves and if these are dangerous enough to cause cancer.</p>
        <p>Highly relevant documents will provide information on a
potential causal connection between microwave ovens and cancer. This
includes documents stating or giving evidence that the first is (or
is not) a cause of the other. Documents stating that there is not
enough evidence to decide either way are also highly relevant.
Relevant documents may contain implicit information on whether the
causal relationship exists or does not exist. Documents are not
relevant if they either mention one or both concepts, but do not
provide any information about their causal relation.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Data Description</title>
        <p>
          Topics. The 50 search topics for Task 2 described scenarios where users search for confirmation
of whether some causal relationship holds. For example, a user may want to know the possible
reason for a current physical condition. Each of these topics had a title (i.e., a causal question),
cause and efect entities, a description specifying the particular search scenario, and a narrative
serving as a guideline for the assessors (see Table 3). The topics were manually selected from a
corpus of causal questions [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] and a graph of causal statements [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] such that they spanned a
diverse set of domains.
        </p>
        <p>Document Collection. The same document collection as in Task 1 was used.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Evaluation Setup</title>
        <p>Relevance assessments were gathered with volunteer human assessors. The assessors were
instructed to label documents as not relevant (0), relevant (1), or highly relevant (2). The direction
of causality was considered, i.e., a document stating that B causes A was considered of-topic
(not relevant) for the topic “Does A cause B?”. The document’s stance was also labeled to
evaluate the optional stance detection task. The labeling procedure was analogous to Task 1,
where volunteer assessors participated in training and a discussion. Agreement on the same
20 randomly selected documents across 4 annotators was measured with Fleiss’ kappa. Before
the discussion, the agreement was  = 0.58 for relevance and  = 0.55 for stance assessment
(both indicate a moderate agreement). After discussing discrepancies, similar to Task 1, each</p>
        <p>F1 macro</p>
        <p>Stance
annotator labeled a disjoint set of topics. We pooled the top-5 documents from each submitted
run (plus additional baseline) and labeled 718 documents in total.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Submitted Approaches and Evaluation Results</title>
        <p>
          One team He-Man [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] participated in Task 2 and submitted three runs. Like the baseline run
Puss in Boots, all three participant runs used ChatNoir [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] for first-stage retrieval. For two runs,
ifrst, the cause and efect events were extracted from the topic title field using dependency tree
parsing. Next, query expansion and query reformulation approaches were applied. In the query
expansion approach, the topic title was expanded with semantically related concepts from the
CauseNet, a graph of causal relations [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. For this, all relations in the CauseNet-Precision
variant were embedded using BERT [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]. Next, the embedding’s cosine similarity was compared
with the embedding of the topic’s relation. The top-5 terms from the documents linked to the
matched CauseNet relation were then used to expand the query. The second approach, the
query reformulation technique, fed the deconstructed topic title in a semi-structured JSON
format to ChatGPT. The chatbot was then prompted to generate new query variants, exchanging
causes, efects, and causal phrases. All three query variants (original topic title, expanded query,
and reformulated query) were then submitted to ChatNoir. Finally, all approaches re-ranked
the results using a position bias. Documents containing the causal relationship from the topic
earlier in the document were ranked higher. To detect the position of the relation, the same
dependency tree parsing developed for the query deconstruction was used.
        </p>
        <p>
          The task’s baseline run of Puss in Boots additionally predicted the document stance by first
summarizing a document’s main content with BART [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ],10 and then zero-shot prompting the
Flan-T5 model [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ].11
        </p>
        <p>Table 4 shows the evaluation results for Task 2 (more detailed results for each submitted
run, including the 95% confidence intervals, is in Table 12 in Appendix B). We report nDCG@5
for relevance-based retrieval efectiveness and macro-averaged F 1 for stance detection. The
Puss in Boots baseline was more efective in terms of relevance than the two participant runs
that used query expansion. However, the participant run, which only applied re-ranking,
10Pre-trained model: https://huggingface.co/facebook/bart-large-cnn; minimum length: 64; maximum length: 256.
11Pre-trained model: https://huggingface.co/google/flan-t5-base; maximum generated tokens: 3; the prompt is given
in Appendix A.
statistically significantly outperformed the baseline. This suggests that the participants’ query
expansion techniques degrade the first-stage retrieval results, and the re-ranking approach
applied afterward cannot compensate for the substantially worse performance of the query
expansion. The participating team opted not to detect the stance. Therefore, only the baseline
run could be evaluated for stance detection, achieving an F1-score of 0.256.</p>
        <p>We additionally investigate whether the retrieval approaches correctly handle the causal
direction of queries. We, therefore, chose 5 of the 50 topics to be the inverse direction of an
already existing topic. Four of these topic pairs are realistic scenarios, e.g., ‘Can depression
lead to a lack of sleep?’ and ‘Can a lack of sleep lead to depression?’. The final pair contains a
somewhat unrealistic and challenging scenario: ‘Can earthquakes cause tsunamis?’ and ‘Can
tsunamis cause earthquakes?’ (i.e., is it feasible that a giant tsunami causes an earthquake?).
Table 5 lists the evaluation results split by topic type. For topic pairs, we report the
macroaveraged arithmetic and harmonic mean. The arithmetic mean shows overall efectiveness. The
harmonic mean reveals if the approaches are equally efective for both directions.</p>
        <p>We find that the baseline run is substantially less efective on the inverted topics than on
the plain topics. The participant approach, which re-ranks according to the causal relation,
performs much better. Additionally, the substantial diference between the arithmetic and
harmonic mean for the inverse topics shows that the approaches are not equally efective for
both directions. Efectiveness for one of the directions is usually much higher than the inverse
direction. Finally, none of the approaches retrieved a relevant document for the challenging
inverse topic, as revealed by the harmonic mean of 0.0.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Task 3: Image Retrieval for Arguments</title>
      <p>The goal of the third task was to provide argumentation support through image search. The
retrieval of relevant images should provide both a quick visual overview of frequent arguments
on some topic and compelling images to support one’s argumentation. To this end, the second
edition of this task continued with the retrieval of images which can be posted to either indicate
an agreement or disagreement to some stance on a given topic. Images should be retrieved as
two separate lists, similar to a textual argument search (e.g., https://args.me).</p>
      <sec id="sec-5-1">
        <title>5.1. Task Definition</title>
        <p>
          Given a controversial topic and a collection of web documents with images, the task was to
retrieve for each stance (pro and con) images that indicate support for that stance. Participants
of Task 3 should retrieve and rank images, possibly utilizing the corresponding web documents,
from a focused crawl of 55,691 images and for a given set of 50 topics (which were used by other
tasks in previous years) [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]. Like in the last edition of this task, the focus is on providing users
with an overview of public opinions on controversial topics, for which we envision a system
that provides not only textual but also visual support for each stance in the form of images.
Participants were able to use the approximately 6,000 relevance judgments from the last edition
of the task for training supervised approaches [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ].12 Similar to the other tasks, participants
were free to use any additional existing tools and datasets or develop their own.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Data Description</title>
        <p>Topics. Task 3 employs 50 controversial topics from earlier Touchè editions (e.g., used in
2021), but which were not used in the first edition of this task. As for Task 1 (cf. Section 3), we
provided for each topic a title, description, and narrative. The description and narrative were
adapted as needed to fit the image retrieval setting.</p>
        <p>Document Collection. This task’s document collection stems from a focused crawl of
55,691 images and associated web pages from late 2022. We downloaded the top-100 images
and associated web pages from Google’s image search for 2,209 queries. Nearly half of the
queries (namely 1,050) were created like in the first edition of this task, by appending filter
words like “good,” “meme,” “stats,” “reasons,” or “efects” to a manually created query for each
topic. The remaining 1,159 queries were collected from participants in an open call, which
allowed anyone to submit queries until the end of December 2022. Of these queries, 557 were
created manually (57 by team Neville Longbottom, 250 by team Hikaru Sulu, and 250 by us), and
the remaining were created using ChatGPT by team Neville Longbottom: they asked ChatGPT
for a list of pro and con arguments for each topic, then for an image description illustrating the
respective arguments, and then for a search query to match the description. From the search
results we attempted to download 147,264 images, but discarded 5,666 for which we could not
download the image, 6,619 for which the image was more than 2,000 pixels wide or high,13
20,696 for which an initial text recognition using Tesseract14 yielded more than 20 words,15
8,538 for which the web page could not be downloaded, 484 for which the web page contained
no text, and 45,254 for which we could not find the image URL in the web page DOM. After a
duplicate detection using pHash,16 the final dataset contains 55,691 images. The dataset contains
various resources for each image, including the associated page for which it was retrieved as an
HTML page and as a detailed web archive,17 information on how Google ranked the image, and
12https://webis.de/data.html#touche-corpora
13As one use case for our task is getting a quick overview of arguments, we excluded overly large images
14https://github.com/tesseract-ocr/tesseract
15To sharpen our focus on images, this year we tried to exclude images that are merely screenshots of text documents
16https://www.phash.org/; same procedure as in the previous year
17Archived using https://github.com/webis-de/scriptor
information from Google’s Cloud Vision API,18 e.g., detected text and objects.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Evaluation Setup</title>
        <p>Our two volunteer human assessors labeled the ranked results by the task participants (i.e.,
the images) for their relevance to the topic’s narrative. First, assessors decided whether an
image is on topic (yes or no). If so, they also decided whether an image is relevant according to
the pro-side of the narrative, its con-side, or both: 0 (not relevant), 1 (relevant), and 2 (highly
relevant), though we did not distinguish between levels 1 and 2 in our evaluation. However,
assessors were instructed that an image could not be highly relevant for both pro and con to
indicate a preference. We provided the assessors with guidelines, discussed several examples,
and discussed edge cases as they came up. Achieved Fleiss’  values (measured on three topics
for which all assessors labeled all images) were for on-topic 0.38 (fair), for pro 0.34 (fair), and
for con 0.31 (fair). Without distinguishing levels 1 and 2, the agreement increases to 0.45 for pro
(moderate) and 0.52 for con (moderate). Our human assessors labeled in total 6,692 images.</p>
        <p>
          Although rank-based metrics for single image grids exist [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ], none have been proposed so
far for a ‘pro-con’ layout. Therefore, participants’ submitted results were evaluated by the ratio
of relevant images among 20 retrieved images, namely 10 images per stance (precision@10).
We again used three increasingly strict definitions of relevance, corresponding to three
precision@10 evaluation measures: being on-topic, being in support of some stance (i.e., an image is
“argumentative”), and being in support of the stance for which the image was retrieved.
        </p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Submitted Approaches and Evaluation Results</title>
        <p>In total, three teams participated in Task 3 and submitted 12 runs in total, not counting the
submitted queries described above. Table 6 shows the results of all submitted runs (more detailed
results for each submitted run, including the 95% confidence intervals, are in Tables 13, 14,
and 15 in Appendix B). Overall, scores are considerably lower than last year, where precision@10
for stance relevance was as high as 0.425. We attribute this to the new set of topics, which
contained much more questions that were hard to picture.</p>
        <p>
          As a baseline (team Minsc), we used the model of [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ], which was developed by a collaboration
of two teams that participated in last year’s task: Aramis and Boromir.19 The approach employed
standard retrieval and a set of handcrafted features for argumentativeness detection. For
retrieval, the approach used Elasticsearch’s BM25 (default settings: 1=1.2 and =0.75) with
each image (document) represented by the text from the web page around the image and text
recognized in the image using Tesseract.14 For argumentativeness detection, the approach used
a neural network classifier based on thirteen diferent features (color properties, image type,
and textual features), and trained on the ground-truth annotations from last year. The features
are calculated from, amongst others, the query, the image text, the HTML text around the
image, the interrelation and sentiments of the mentioned texts, and the colors in the image.
18https://cloud.google.com/vision
19Since no stance model convincingly outperformed naive baselines in their evaluation, we use the simple both-sides
baseline that assigns each image to both stances
The approach used random stance assignment. Since this baseline performed much worse than
anticipated, we expect a bug in the implementation.
        </p>
        <p>
          Team Hikaru Sulu submitted two valid runs. Their approach used CLIP [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ] to calculate the
similarity between keywords and images and retrieved, per topic, the images most similar to
one of the keywords. For the first run, they used the topic title as a keyword, but for the second
run, they extracted all nouns and verbs from the topic title and extended that list with synonyms
and antonyms from WordNet [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ]. The stance was determined randomly, which performed
in their internal evaluation better than using diferent keywords for pro and con. As Table 6
shows, the extended list lead to retrieving more on-topic images, but less argumentative ones.
        </p>
        <p>
          Team Jean-Luc Picard [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ] submitted five valid runs. Their first run used the web page text
indexed by PyTerrier’s BM25 [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] (default settings: 1=1.2 and =0.75). For the other runs,
they used a pipeline of query preprocessing, the same BM25-based retrieval as their first run,
stance detection, and re-ranking. For query preprocessing, they created a parse tree of the topic
and filtered out frequent words to create a short query. The runs correspond to four diferent
stance detection approaches: (1) random or (2) using a zero-shot classification based on the
pre-trained BART MultiNLI model20 that assigns the image to pro, contra, or neutral (i.e., will be
discarded) based on the (a) web page text, (b) the image text, or (c) both texts. After that, images
were re-ranked: for each topic, images were generated with Stable Difusion [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ] using the
preprocessed query as prompt, then SIFT keypoints were identified [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ] in both retrieved and
generated image and matched between the two images, and then the result list was re-ranked
as per the number of matched keypoints in descending order. Similar to the internal evaluation
of team Hikaru Sulu, a random stance assignment performed best.
20https://huggingface.co/facebook/bart-large-mnli
        </p>
        <p>
          Team Neville Longbottom [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ] submitted five valid runs. They first employed ChatGPT 21
to generate image descriptions for each topic and stance (neither description nor narrative
was used). Then, they retrieved images with these descriptions, either (1) using the web page
text close to the image indexed via PyTerrier’s BM25 [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] (default settings: 1=1.2 and =0.75)
or (2) using CLIP [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ] for ranking images by their similarity to the description. For runs 3–5,
the approach continued by re-ranking the result list, either (a) by penalizing the BM25-score
of an image with the BM25-score of the image for the respective other stance’s description
(re-ranking the results of run (1)) or (b) by using IBM’s debater pro-con score [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ] between
the topic title and the text close to the image on the web page (2 runs; re-ranking results of
run (1) or (2)). The CLIP method without re-ranking performed best.
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Task 4: Multilingual Multi-Target Stance Classification</title>
      <p>In this edition of the Touché lab, we proposed a new task on multilingual multi-target stance
classification of comments to proposals coming from an online participatory democracy platform.
The goal of the fourth task was to build technologies that help analyze opinions on a wide range
of socially important topics. Large-scale deployment of such technologies faces challenges like
multilingualism or high variability of the topics of interest and hence is the target of this task.</p>
      <sec id="sec-6-1">
        <title>6.1. Task Definition</title>
        <p>Given a proposal on a socially important issue, its title, and its topic, the task was to classify
whether a comment on the proposal is ‘in favor’, ‘against, or ‘neutral’ towards the commented
proposal. The participants needed to classify multilingual comments written in 6 diferent
languages22 into the 3 stance classes. Comments to the proposals could be written in a diferent
language than the proposal itself, and multiple comments could target the same proposal.</p>
        <p>
          Within the task, we organized two subtasks: (1) Cross-debate Classification , where the
participants were not allowed to use for training comments on those proposals that also had comments
in the test set, and (2) All-data-available Classification , where the participants could use all
the available data. Also, the participants could use any additional existing tools or previously
published datasets like Debating Europe [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ] or X-Stance [40].
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Data Description</title>
        <p>The proposals and comments used in Task 4 stem from the Conference on the Future of
Europe (CoFE),23 an online debating platform where users can write proposals and comment
on the suggested ideas. The initially obtained dataset was comprised of 4,247 proposals and
20,102 comments written in 26 languages (24 oficial languages of the European Union plus
Catalan and Esperanto) [41, 42]. As shown in Figure 1, English, German, and French were the
21https://chat.openai.com/chat
22English, French, German, Greek, Hungarian, and Italian.
23https://futureu.europa.eu
Set up a program for returnable food packaging made from
recyclable materials
The European Union could set up a program for returnable food
packaging made from recyclable materials (e.g. stainless steel, glass).</p>
        <p>These packaging would be produced on the basis of open standards
and cleaned according to [. . . ]
Ja, wir müssen den Verpackungsmül reduzieren
most commonly used languages on the platform. An example of a proposal, a corresponding
comment, and the stance of the comment is shown in Table 7.24</p>
        <p>For developing stance classifiers, participants were provided with three datasets: (1) CF U: a
large set of unlabeled comment–proposal pairs, (2) CFS: a large set of comment–proposal
pairs where comment authors selected either ‘in favor’ or ‘against’ stance (no ‘neutral’ label
was available for selection), and (3) CFE-D: a smaller set of comment–proposal pairs manually
annotated by expert native speakers with three stance labels. The fourth dataset CFE-T was also
labeled by experts and was used to evaluate submitted approaches (see Table 8). The dataset
contained texts written in 6 most common languages omitting Spanish (see Figure 1). For
labeling the CFE-D and CFE-T datasets, untranslated comments and English translations of the
proposals—to better understand the context—were used.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Submitted Approaches and Evaluation Results</title>
        <p>Two teams participated in Task 4 and submitted 8 runs in total. Below, we briefly describe the
participants’ approaches plus additional baseline runs.</p>
        <p>
          Team Cavalier was our baseline that implemented three stance classifiers. For Subtask 1
(cross-debate classification), we implemented two baseline classifiers: The first one (Cavalier
Simple) simply always predicts the majority class (‘in favor’). The second baseline (Cavalier)
is based on the transformer-based multilingual masked language model XLM-R [43, 42]. This
model was first fine-tuned on the X-Stance dataset [ 40] and the CF dataset to classify just two
stance classes (‘in favor’ or ‘against’) and subsequently fine-tuned again on the Debating Europe
dataset [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ] to classify all three stance classes (‘in favor’, ‘against’, or ‘neutral’). All comments
on proposals appearing in the test set CFE-T were removed before fine-tuning. The baseline
classifier for Subtask 2 (all-data-available classification) used the same model and analogous
training steps as for Subtask 1, including comments on proposals that appeared in the test set.
        </p>
        <p>
          Team Silver Surfer [44] submitted six valid runs to Subtask 2. Their stance classifiers were
based on fine-tuning pre-trained English and multilingual transformer models: a RoBERTa
24From https://futureu.europa.eu/en/processes/GreenDeal/f/1/proposals/83.
model [45],25 an XLM-R model [43],26 and two BERT models [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ].27 To increase the size of the
training data, the team applied data augmentation using back-translation (i.e., translating texts
to other languages and then back to the original language) [46] and used label spreading [47] to
transfer labels from the CFE-D dataset to the CFU dataset. The team first fine-tuned a RoBERTa
model (Run 2, comments translated to English) and an XLM-R model (Run 3, no translation) on
the CFS dataset as well as on the CFU dataset after applying label spreading. Run 4 used the
CFE-D dataset after data augmentation using back-translation to fine-tune an XLM-R model.
For Run 5, the team fine-tuned a RoBERTa model on the comments from the CF E-D dataset,
translating all comments to English. The team’s Run 6 used a two-step training approach,
where they first fine-tuned an English BERT model on binary stance classification based on the
translated comments from the CFS dataset and subsequently fine-tuned the model to classify all
three stance classes on translated comments from the CFE-D dataset. Finally, Team Silver Surfer
combined comment metadata features (e.g., number of upvotes/downvotes, endorsements) and
the output probabilities from six fine-tuned transformer models in an XGBoost classifier (Run 1):
(1) RoBERTa fine-tuned on the CF E-D dataset (comments translated to English, same as Run 5),
(2) XLM-R fine-tuned on the CF E-D dataset (no translation), (3) RoBERTa fine-tuned on the
25https://huggingface.co/roberta-base
26https://huggingface.co/xlm-roberta-large
27https://huggingface.co/bert-base-uncased and https://huggingface.co/bert-base-multilingual-uncased
CFS dataset (translation to English), (4) XLM-R fine-tuned on the CF S dataset (no translation),
(5) English BERT fine-tuned on the CF S and CFE-D datasets (two-step fine-tuning, comments
translated to English, same as Run 6), and (6) multilingual BERT fine-tuned on the CF S and
CFE-D datasets (two-step fine-tuning, no translation, analogous to Run 6).
        </p>
        <p>
          Team Queen of Swords [48] submitted two valid runs that were trained in a two-step
finetuning setting on a combination of the labeled (CFS and CFE-D) and unlabeled datasets (CFU).
To derive labels for the CFU dataset, the team first fine-tuned a multilingual BERT model [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]28
only on the CFS and CFE-D datasets and used the fine-tuned model to predict labels on the
CFU dataset. Their final BERT-based classifier was then again fine-tuned on the predicted labels
for the CFU dataset (only the comment–proposal pairs whose labels were predicted above a
certain probability were used) and the ground-truth labels from the CFS and CFE-D datasets.
The team submitted their best configuration (probability threshold: 0.9) for Subtask 1 and used
the same hyperparameters to fine-tune a BERT model on the larger dataset of Subtask 2.
        </p>
        <p>The submitted approaches were evaluated using macro-averaged F1-scores (to account for
the class imbalance; see CFE-T in Table 8) and accuracy. Table 9 shows the evaluation results
per language and across all languages in the test set. None of the submitted participant runs
outperformed the baseline (Cavalier) in both subtasks.</p>
        <p>Hungarian was the most challenging language for the baselines, which is the most
morphosyntactically distant from the other languages. Conversely, the participants’ classifiers were
28https://huggingface.co/bert-base-multilingual-cased
least efective for the German language and did not consistently struggle with Hungarian.
Interestingly, the Cavalier baseline for Subtask 2 yielded better scores for Italian comments,
even though most of the other runs performed better on English comments. However, we could
not observe patterns regarding the use of multilingual transformer models or English models
with translation before classification. Both approaches seemed to work equally well.</p>
        <p>The best runs of Subtask 2 (our baseline and Silver Surfer Run 6) used a two-step fine-tuning
setting, where the model was first trained to learn binary stance classification and subsequently
was fine-tuned on three stance labels (including ‘neutral’). These results indicate that breaking
down stance classification into several steps can improve its efectiveness.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>The fourth edition of the Touché lab featured four tasks: (1) argument retrieval for controversial
topics, (2) causal retrieval, (3) image retrieval for arguments, and (4) multilingual multi-target
stance classification. In contrast to the prior iterations of the Touché lab, the main challenge
for the participants was to apply argument analysis methodology on long web documents.
Furthermore, we expanded the lab’s scope by introducing new tasks on evidence retrieval for
causal relationships and on predicting the stance of multilingual texts.</p>
      <p>Overall, 7 teams participated in the tasks and submitted a total of 30 runs. The participants
often used approaches that were efective in previous Touché editions, like sparse retrieval for
an initial result set that then is re-ranked based on argument quality estimation and stance
prediction. This year, many also used generative language models like ChatGPT as classifiers
with various prompt-engineering techniques.</p>
      <p>For Tasks 1 and 2, the teams used ChatNoir as their first-stage retrieval system and then
re-ranked documents based on the predicted argument quality and stance (Task 1) or based
on the presence of causal relationships (Task 2). Both re-ranking ideas improved the retrieval
efectiveness compared to the first-stage retrieval results. For Task 3, the four most efective
runs all employed CLIP embeddings to find images that are similar to some text, which means
dense retrieval approaches outperformed traditional approaches this year. However, none of
the systems could predict an image’s stance better than random guessing. To classify the stance
of multilingual texts (Task 4), the participants used BERT-based models, and the most successful
runs employed a two-step fine-tuning: first, using binary stance labels and then learning the
‘neutral’ class. Overall, stance prediction remained the hardest task across all four tasks.</p>
      <p>As the number of active teams substantially decreased in the fourth edition of Touché (7 active
teams in 2023 compared to 23 in 2022, 27 in 2021, and 17 in 2020), we decided to pause the
argument and causal retrieval tasks for now. Still, to support researchers working on argument
or causal retrieval, all Touché resources will remain freely available, including the topics, the
manual judgments (relevance, argument quality, stance), and the runs from the teams.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work has been partially supported by the Deutsche Forschungsgemeinschaft (DFG) in the
project “ACQuA 2.0: Answering Comparative Questions with Arguments” (project 376430233) as
part of the priority program “RATIO: Robust Argumentation Machines” (SPP 1999). V. Barriere’s
work was funded by the National Center for Artificial Intelligence CENIA FB210017, Basal
ANID. This work has been partially supported by the OpenWebSearch.eu project (funded by
the EU; GA 101070014).
classification dataset of online debates, in: Proceedings of PoliticalNLP 2022, ELRA, 2022,
pp. 16–21. URL: https://aclanthology.org/2022.politicalnlp-1.3.
[40] J. Vamvas, R. Sennrich, X-stance: A multilingual multi-target dataset for stance detection,
in: Proceedings of SwissText/KONVENS 2020, CEUR-WS.org, 2020. URL: https://ceur-ws.
org/Vol-2624/paper9.pdf.
[41] V. Barriere, G. Jacquet, L. Hemamou, CoFE: A new dataset of intra-multilingual
multitarget stance classification from an online european participatory democracy platform, in:
Proceedings of AACL-IJCNLP 2022, 2022, pp. 418–422.
[42] V. Barriere, A. Balahur, Multilingual multi-target stance recognition in online public
consultations, Mathematics 11 (2023) 2161.
[43] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave,
M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning
at scale, in: Proceedings of ACL 2020, ACL, 2020, pp. 8440–8451. doi:10.18653/v1/
2020.acl-main.747.
[44] J. P. Avila, A. Rodrigo, R. Centeno, Silver Surfer team at Touché task 4: Testing data
augmentation and label propagation for multilingual stance detection, in: Working Notes
of CLEF 2023, CEUR Workshop Proceedings, CEUR-WS.org, 2023.
[45] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,
V. Stoyanov, RoBERTa: A robustly optimized BERT pretraining approach, arXiv (2019).
doi:10.48550/arXiv.1907.11692.
[46] A. Sugiyama, N. Yoshinaga, Data augmentation using back-translation for context-aware
neural machine translation, in: Proceedings of DiscoMT@EMNLP 2019, ACL, 2019, pp.
35–44. doi:10.18653/v1/D19-6504.
[47] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, B. Schölkopf, Learning with local and global
consistency, in: Proceedings of NIPS 2003, MIT Press, 2003, pp. 321–328. URL: https:
//proceedings.neurips.cc/paper/2003/hash/87682805257e619d49b8e0dfdc14afa-Abstract.
html.
[48] K. Schaefer, Queen of Swords at Touché 2023: Intra-multilingual multi-target stance
classification using BERT, in: Working Notes of CLEF 2023, CEUR Workshop Proceedings,
CEUR-WS.org, 2023.</p>
    </sec>
    <sec id="sec-9">
      <title>A. Zero-shot Prompts</title>
      <p>The zero-shot prompts used for the stance prediction baselines are given in Listing 1 (for Task 1,
see Section 3) and in Listing 2 (for Task 2, see Section 4).</p>
      <p>Given a query, predict the stance of a given text. The stance should be one of the following
four labels:
PRO: The text contains opinions or arguments in favor of the query "&lt;query&gt;".
CON: The text contains opinions or arguments against the query "&lt;query&gt;".</p>
      <p>NEU: The text contains as many arguments in favor of as it contains against the query "&lt;query
&gt;".</p>
      <p>UNK: The text is not relevant to the query "&lt;query&gt;", or it only contains factual information
.</p>
      <p>Text: &lt;summary&gt;
Listing 1: Zero-shot prompt to predict the stance of a document towards a query (Task 1). The
placeholder &lt;query&gt; is replaced by the topic titles, and &lt;summary&gt; for a short summary of
the retrieved document’s text. The UNK label is mapped to NO.</p>
      <p>Given a query, predict the stance of a given text. The stance should be one of the following
four labels:
SUP: According to the text, &lt;cause&gt; causes &lt;effect&gt;.</p>
      <p>REF: According to the text, &lt;cause&gt; does not cause &lt;effect&gt;.</p>
      <p>UNK: The text is not relevant to &lt;cause&gt; and &lt;effect&gt;.</p>
      <p>Text: &lt;summary&gt;
Listing 2: Zero-shot prompt to predict the causal stance of a document towards a query (Task 2).
The placeholders &lt;cause&gt; and &lt;effect&gt; are replaced with the query’s cause and efect
entities, and &lt;summary&gt; with a short summary of the retrieved document’s text. The UNK
label is mapped to NO. The NEU label is not considered in the prompt.</p>
    </sec>
    <sec id="sec-10">
      <title>B. Full Evaluation Results of Touché 2023: Argument and Causal</title>
    </sec>
    <sec id="sec-11">
      <title>Retrieval</title>
      <p>Low</p>
      <p>Low</p>
      <p>Low
Topic-title
clip_chatgpt_args.debater
clip_chatgpt_args.raw
Keywords
No stance detection
bm25_chatgpt_args.raw
Text+image text stance detection
BM25 Baseline
Text stance detection
bm25_chatgpt_args.dif
bm25_chatgpt_args.debater
Image text stance detection
Aramis</p>
      <p>Team</p>
      <p>Team</p>
      <p>Low</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bondarenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kiesel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Schlatt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Barriere</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ravenet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hemamou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Luck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Reimer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          , Overview of Touché 2023:
          <article-title>Argument and causal retrieval</article-title>
          ,
          <source>in: Proceedings of CLEF 2023, Lecture Notes in Computer Science</source>
          , Springer, Berlin Heidelberg New York,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>I. Ajzen</surname>
          </string-name>
          ,
          <article-title>The social psychology of decision making, in: Social Psychology: Handbook of Basic Principles</article-title>
          , Guilford Press,
          <year>1996</year>
          , pp.
          <fpage>297</fpage>
          -
          <lpage>325</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Walker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hancock-Beaulieu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gatford</surname>
          </string-name>
          , Okapi at TREC-3, in:
          <source>Proceedings of TREC</source>
          <year>1994</year>
          , volume
          <volume>500</volume>
          -225 of NIST Special Publication,
          <string-name>
            <surname>NIST</surname>
          </string-name>
          ,
          <year>1994</year>
          , pp.
          <fpage>109</fpage>
          -
          <lpage>126</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zaragoza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Taylor</surname>
          </string-name>
          ,
          <article-title>Simple BM25 extension to multiple weighted ifelds</article-title>
          ,
          <source>in: Proceedings of CIKM</source>
          <year>2004</year>
          , ACM,
          <year>2004</year>
          , pp.
          <fpage>42</fpage>
          -
          <lpage>49</lpage>
          . doi:
          <volume>10</volume>
          .1145/1031171. 1031181.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N.</given-names>
            <surname>Thakur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rücklé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models</article-title>
          ,
          <source>in: Proceedings of NeurIPS</source>
          <year>2021</year>
          , NeurIPS,
          <year>2021</year>
          . URL: https://datasets-benchmarks-proceedings.neurips.cc/ paper/2021/hash/65b9eea6e1cc6bb9f0cd2a47751a186f-Abstract-round2.
          <fpage>html</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>MacAvaney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Feldman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Downey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cohan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goharian</surname>
          </string-name>
          ,
          <article-title>Simplified data wrangling with ir_datasets</article-title>
          ,
          <source>in: Proceedings of SIGIR</source>
          <year>2021</year>
          , ACM,
          <year>2021</year>
          , pp.
          <fpage>2429</fpage>
          -
          <lpage>2436</lpage>
          . doi:
          <volume>10</volume>
          .1145/3404835.3463254.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kolyada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Grahm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Elstner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Loebe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <article-title>Continuous integration for reproducible shared tasks with TIRA.io</article-title>
          ,
          <source>in: Proceedings of ECIR 2023, Lecture Notes in Computer Science</source>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>236</fpage>
          -
          <lpage>241</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bondarenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Beloucif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gienapp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ajjour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Biemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wachsmuth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          , Overview of Touché 2020:
          <article-title>Argument retrieval</article-title>
          ,
          <source>in: Working Notes of CLEF</source>
          <year>2020</year>
          , volume
          <volume>2696</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2020</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2696</volume>
          /paper_261.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bondarenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gienapp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Beloucif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ajjour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Biemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wachsmuth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          , Overview of Touché 2021:
          <article-title>Argument retrieval</article-title>
          ,
          <source>in: Working Notes of CLEF</source>
          <year>2021</year>
          , volume
          <volume>2936</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>2258</fpage>
          -
          <lpage>2284</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2936</volume>
          /paper-205.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bondarenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kiesel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Syed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gurcke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Beloucif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Biemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wachsmuth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          , Overview of Touché 2022:
          <article-title>Argument retrieval</article-title>
          ,
          <source>in: Working Notes of CLEF</source>
          <year>2022</year>
          , volume
          <volume>3180</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>2867</fpage>
          -
          <lpage>2903</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3180</volume>
          /paper-247. pdf.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Overwijk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Callan,</surname>
          </string-name>
          <article-title>ClueWeb22: 10 billion web documents with rich information</article-title>
          ,
          <source>in: Proceedsings of SIGIR</source>
          <year>2022</year>
          , ACM,
          <year>2022</year>
          , pp.
          <fpage>3360</fpage>
          -
          <lpage>3362</lpage>
          . doi:
          <volume>10</volume>
          .1145/ 3477495.3536321.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          , Elastic ChatNoir:
          <article-title>Search engine for the ClueWeb and the Common Crawl</article-title>
          ,
          <source>in: Proceedings of ECIR</source>
          <year>2018</year>
          , volume
          <volume>10772</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2018</year>
          , pp.
          <fpage>820</fpage>
          -
          <lpage>824</lpage>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>319</fpage>
          -76941-7\_
          <fpage>83</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chernodub</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Oliynyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Heidenreich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bondarenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Biemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          , TARGER:
          <article-title>Neural argument mining at your fingertips</article-title>
          ,
          <source>in: Proceedings of ACL</source>
          <year>2019</year>
          , ACL,
          <year>2019</year>
          , pp.
          <fpage>195</fpage>
          -
          <lpage>200</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/p19-
          <fpage>3031</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C.</given-names>
            <surname>Macdonald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tonellotto</surname>
          </string-name>
          , S. MacAvaney, I. Ounis,
          <article-title>PyTerrier: Declarative experimentation in Python from BM25 to dense retrieval</article-title>
          ,
          <source>in: Proceedings of CIKM</source>
          <year>2021</year>
          , ACM,
          <year>2021</year>
          , pp.
          <fpage>4526</fpage>
          -
          <lpage>4533</lpage>
          . doi:
          <volume>10</volume>
          .1145/3459637.3482013.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wachsmuth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Naderi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bilu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Prabhakaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. A.</given-names>
            <surname>Thijm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Hirst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <article-title>Computational argumentation quality assessment in natural language</article-title>
          ,
          <source>in: Proceedings of EACL</source>
          <year>2017</year>
          , ACL,
          <year>2017</year>
          , pp.
          <fpage>176</fpage>
          -
          <lpage>187</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/e17-
          <fpage>1017</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>J. R. M. Palotti</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Scells</surname>
          </string-name>
          , G. Zuccon,
          <article-title>TrecTools: an open-source Python library for information retrieval practitioners involved in TREC-like campaigns</article-title>
          ,
          <source>in: Proceedings of SIGIR</source>
          <year>2019</year>
          , ACM,
          <year>2019</year>
          , pp.
          <fpage>1325</fpage>
          -
          <lpage>1328</lpage>
          . doi:
          <volume>10</volume>
          .1145/3331184.3331399.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>G. V.</given-names>
            <surname>Cormack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Smucker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L. A.</given-names>
            <surname>Clarke</surname>
          </string-name>
          ,
          <article-title>Eficient and efective spam filtering and re-ranking for large web datasets</article-title>
          ,
          <source>Information Retrieval Journal</source>
          <volume>14</volume>
          (
          <year>2011</year>
          )
          <fpage>441</fpage>
          -
          <lpage>465</lpage>
          . doi:
          <volume>10</volume>
          .1007/s10791-011-9162-z.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Longpre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fedus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brahma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Webson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Suzgun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chowdhery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Petrov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. H.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <article-title>Scaling instruction-finetuned language models</article-title>
          ,
          <source>arXiv</source>
          (
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2210.11416.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghazvininejad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          , L. Zettlemoyer, BART:
          <article-title>Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension</article-title>
          ,
          <source>in: Proceedings of ACL</source>
          <year>2020</year>
          , ACL,
          <year>2020</year>
          , pp.
          <fpage>7871</fpage>
          -
          <lpage>7880</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>703</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Plenz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Buchmüller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bondarenko</surname>
          </string-name>
          ,
          <article-title>Argument quality prediction for ranking documents</article-title>
          ,
          <source>in: Working Notes of CLEF</source>
          <year>2023</year>
          , CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>P.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bolton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>Stanza: A python natural language processing toolkit for many human languages</article-title>
          ,
          <source>in: Proceedings of ACL</source>
          <year>2020</year>
          , ACL,
          <year>2020</year>
          , pp.
          <fpage>101</fpage>
          -
          <lpage>108</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-demos.
          <volume>14</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>G.</given-names>
            <surname>Ke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Finley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , W. Ma,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ye</surname>
          </string-name>
          , T. Liu,
          <article-title>Lightgbm: A highly eficient gradient boosting decision tree</article-title>
          ,
          <source>in: Proceedings of NeurIPS</source>
          <year>2017</year>
          , NeurIPS,
          <year>2017</year>
          , pp.
          <fpage>3146</fpage>
          -
          <lpage>3154</lpage>
          . URL: https://proceedings.neurips.cc/paper/2017/file/ 6449f44a102fde848669bdd9eb6b76fa-Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>H.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kasai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ostendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>One embedder, any task: Instruction-finetuned text embeddings</article-title>
          ,
          <source>arXiv</source>
          (
          <year>2022</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2212.09741.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bondarenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wolska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Heindorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Blübaum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Braslavski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          , M. Potthast,
          <article-title>CausalQA: A benchmark for causal question answering</article-title>
          ,
          <source>in: Proceedings of COLING</source>
          <year>2022</year>
          , ICCL,
          <year>2022</year>
          , pp.
          <fpage>3296</fpage>
          -
          <lpage>3308</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .coling-
          <volume>1</volume>
          .
          <fpage>291</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>S.</given-names>
            <surname>Heindorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Scholten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wachsmuth</surname>
          </string-name>
          , A.
          <string-name>
            <surname>-C. Ngonga Ngomo</surname>
          </string-name>
          , M. Potthast,
          <article-title>CauseNet: Towards a causality graph extracted from the web</article-title>
          ,
          <source>in: Proceedings of CIKM</source>
          <year>2020</year>
          , ACM,
          <year>2020</year>
          , pp.
          <fpage>3023</fpage>
          -
          <lpage>3030</lpage>
          . doi:
          <volume>10</volume>
          .1145/3340531.3412763.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gaden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Reinhold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zeit-Altpeter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rausch</surname>
          </string-name>
          ,
          <article-title>Evidence retrieval for causal questions using query expansion and reranking</article-title>
          ,
          <source>in: Working Notes of CLEF</source>
          <year>2023</year>
          , CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <source>in: Proceedings of NAACL-HLT</source>
          <year>2019</year>
          , ACL,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/n19-
          <fpage>1423</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kiesel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <article-title>Dataset Touché23-Image-Retrieval-for-</article-title>
          <string-name>
            <surname>Arguments</surname>
          </string-name>
          ,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .5281/zenodo.7497994.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kiesel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <article-title>Dataset Touché22-Image-Retrieval-for-</article-title>
          <string-name>
            <surname>Arguments</surname>
          </string-name>
          ,
          <year>2022</year>
          . doi:
          <volume>10</volume>
          .5281/zenodo.6786948.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>X.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          , M. de Rijke,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ma,
          <article-title>Grid-based evaluation metrics for web image search</article-title>
          ,
          <source>in: Proceedings of WWW</source>
          <year>2019</year>
          , ACM,
          <year>2019</year>
          , pp.
          <fpage>2103</fpage>
          -
          <lpage>2114</lpage>
          . doi:
          <volume>10</volume>
          .1145/3308558.3313514.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>M. L. Carnot</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Heinemann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Braker</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Schreieder</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kiesel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Fröbe</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Stein</surname>
          </string-name>
          ,
          <article-title>On stance detection in image retrieval for argumentation</article-title>
          ,
          <source>in: Proceedings of SIGIR</source>
          <year>2023</year>
          , ACM,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .1145/3539618.3591917.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Krueger</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <source>in: Proceedings of ICML</source>
          <year>2021</year>
          , volume
          <volume>139</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          . URL: https://proceedings.mlr. press/v139/radford21a.html.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>C.</given-names>
            <surname>Fellbaum</surname>
          </string-name>
          ,
          <source>WordNet: An Electronic Lexical Database</source>
          , MIT Press,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>M.</given-names>
            <surname>Möbius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Enderling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bachinger</surname>
          </string-name>
          ,
          <string-name>
            <surname>Jean-Luc Picard</surname>
          </string-name>
          at Touché 2023:
          <article-title>Comparing image generation, stance detection and feature matching for image retrieval for arguments</article-title>
          ,
          <source>in: Working Notes of CLEF</source>
          <year>2023</year>
          , CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rombach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Blattmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lorenz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Esser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ommer</surname>
          </string-name>
          ,
          <article-title>High-resolution image synthesis with latent difusion models</article-title>
          ,
          <source>in: Proceedings of CVPR</source>
          <year>2022</year>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>10674</fpage>
          -
          <lpage>10685</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR52688.
          <year>2022</year>
          .
          <volume>01042</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>D. G.</given-names>
            <surname>Lowe</surname>
          </string-name>
          ,
          <article-title>Distinctive image features from scale-invariant keypoints</article-title>
          ,
          <source>International Journal of Computer Vision</source>
          <volume>60</volume>
          (
          <year>2004</year>
          )
          <fpage>91</fpage>
          -
          <lpage>110</lpage>
          . doi:
          <volume>10</volume>
          .1023/B:VISI.
          <volume>0000029664</volume>
          . 99615.94.
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>D.</given-names>
            <surname>Elagina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.-A.</given-names>
            <surname>Heizmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Koch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lahmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ortlepp</surname>
          </string-name>
          , Neville Longbottom at Touché 2023:
          <article-title>Image retrieval for arguments using ChatGPT, CLIP</article-title>
          and IBM Debater,
          <source>in: Working Notes of CLEF</source>
          <year>2023</year>
          , CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>R.</given-names>
            <surname>Bar-Haim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kantor</surname>
          </string-name>
          , E. Venezian,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Katz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Slonim</surname>
          </string-name>
          ,
          <article-title>Project debater APIs: Decomposing the AI grand challenge</article-title>
          ,
          <source>in: Proceedings of EMNLP</source>
          <year>2021</year>
          , ACL,
          <year>2021</year>
          , pp.
          <fpage>267</fpage>
          -
          <lpage>274</lpage>
          . doi:
          <volume>10</volume>
          . 18653/v1/
          <year>2021</year>
          .emnlp-demo.
          <volume>31</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>V.</given-names>
            <surname>Barriere</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Balahur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ravenet</surname>
          </string-name>
          , Debating Europe:
          <article-title>A multilingual multi-target stance</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>